Teach-to-Reason: Competition-Guided Reasoning with a Self-Improving Teacher

Hao Liu; Hui Guo; Jile Jiao; Xiaofeng Mou; Xiao Han; Yi Xu; Yue Wang; Zhimin Bao

arxiv: 2606.25407 · v1 · pith:YV6MKVRVnew · submitted 2026-06-24 · 💻 cs.CV

Teach-to-Reason: Competition-Guided Reasoning with a Self-Improving Teacher

Xiao Han , Hao Liu , Zhimin Bao , Jile Jiao , Yue Wang , Hui Guo , Xiaofeng Mou , Yi Xu This is my paper

Pith reviewed 2026-06-25 20:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords chest x-rayvisual question answeringchain-of-thoughtreinforcement learningcomparison-based supervisionself-improving teachermedical reasoning

0 comments

The pith

A self-improving Teacher supplies comparison-based references that strengthen chain-of-thought reasoning in chest X-ray visual question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Teach-to-Reason, a framework that pairs a Teacher model strengthened through internal competition with a Reasoner that learns from those progressively harder references. Existing reinforcement learning for this medical task uses only final-answer correctness as reward, which often gives no useful gradient once all sampled answers in a group score the same. The method adds a case-wise reward that keeps the original positive-negative split when it remains informative and switches to competition scores when the answer reward vanishes. On several chest X-ray open-ended VQA benchmarks the combined signal produces higher reasoning quality than standard answer-only training. A reader would care because medical applications need trustworthy step-by-step explanations, not merely correct yes-no or short answers.

Core claim

Teach-to-Reason integrates comparison-based supervision into chain-of-thought optimization by maintaining a self-improving Teacher that generates reference answers through repeated self-competition; the Reasoner is then trained against these references using a case-wise reward that preserves the original reward-induced partition when informative and restores supervision from competition scores when group-level advantages collapse to zero.

What carries the argument

Teach-to-Reason framework: a self-improving Teacher that generates references via self-competition and a competition-guided Reasoner trained with case-wise rewards that blend answer correctness and comparison scores.

If this is right

The Reasoner produces higher-quality chain-of-thought traces than answer-level reinforcement learning alone.
Performance gains appear consistently across multiple chest X-ray open-ended VQA benchmarks.
Supervision remains available even after group-level answer advantages reach zero.
The Teacher improves iteratively without external labeled reasoning data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same competition-plus-case-wise pattern could be tested on non-medical visual question answering tasks where reasoning quality matters.
If the Teacher's self-competition proves stable, the approach reduces dependence on human-written reasoning chains for training.
The method highlights a general way to keep reinforcement learning gradients alive when outcome rewards become uniform.

Load-bearing premise

The Teacher can be strengthened repeatedly through its own competition outcomes and the case-wise rule will reliably supply a useful training signal precisely when answer rewards become uninformative.

What would settle it

Running the same CXR VQA benchmarks with T2R and measuring no gain in chain-of-thought quality metrics over plain answer-reward reinforcement learning would falsify the central claim.

read the original abstract

Chest X-ray visual question answering (CXR VQA) requires models not only to predict correct answers, but also to produce reliable medical reasoning. However, existing reinforcement-learning-based training typically relies on answer-level rewards, which are often too coarse to improve chain-of-thought (CoT) quality and can become ineffective when group-level advantages collapse to zero. We propose \textbf{Teach-to-Reason (T2R)}, a framework that introduces comparison-based supervision into CoT optimization through a self-improving \emph{Teacher} and a competition-guided \emph{Reasoner}. As the Teacher is iteratively strengthened via self-competition, the Reasoner is optimized against progressively stronger Teacher-generated references. We further introduce a case-wise reward design that preserves the original reward-induced positive/negative partition when it is informative, and restores supervision from competition scores when the original reward signal degenerates. Experiments on multiple CXR open-ended VQA benchmarks show that T2R consistently outperforms strong baselines, indicating that comparison-based supervision, when integrated in a controlled and principled manner, provides a more effective training signal for reasoning optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

T2R proposes a case-wise reward switch plus self-competition to keep RL training alive when group advantages collapse in medical VQA, but the empirical support is still unexamined.

read the letter

The main thing to know is that this paper introduces Teach-to-Reason to handle a real failure mode in RL-based CoT training for CXR VQA: when answer-level rewards produce zero group advantages, supervision disappears. The fix combines a self-improving teacher that gets stronger through internal competition with a reasoner trained against those references, plus a case-wise reward that keeps the original positive/negative split when it is still informative and falls back to competition scores when it is not.

The framework is new in the controlled way it layers comparison-based supervision on top of existing RL without discarding the original signal. The description of the reward switch is clear and internally consistent, and it directly targets the coarseness problem the authors name.

The experiments are reported to show consistent gains over strong baselines on multiple open-ended benchmarks, but the abstract supplies no numbers, no ablation on the switch itself, and no detail on how the teacher is initialized or prevented from drifting. That leaves the size and reliability of the improvement impossible to judge from what is here.

The paper is aimed at researchers working on RL for visual reasoning or medical VQA who have run into vanishing advantages. A reader looking for concrete reward-design ideas would find the case-wise mechanism worth testing.

It deserves peer review because the limitation it addresses is common and the proposed mechanism is specific enough to evaluate.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces Teach-to-Reason (T2R), a framework for Chest X-ray visual question answering (CXR VQA) that augments reinforcement-learning-based chain-of-thought optimization. It employs a self-improving Teacher strengthened via self-competition to generate progressively stronger reference answers, a competition-guided Reasoner trained against those references, and a case-wise reward that retains the original reward-induced positive/negative partition when informative but switches to competition scores when group-level advantages collapse to zero. Experiments on multiple CXR open-ended VQA benchmarks are reported to show consistent outperformance over strong baselines, supporting the claim that controlled comparison-based supervision yields a more effective training signal for reasoning.

Significance. If the empirical claims hold, the work addresses a recognized limitation of answer-level rewards in RL for reasoning tasks and demonstrates a mechanism for restoring supervision without introducing free parameters. The self-competition loop and case-wise switching rule constitute a principled integration of comparison signals; credit is due for the explicit design that aims to avoid collapse while preserving original partitions when they remain informative. The approach could influence subsequent work on medical VQA and broader CoT optimization if the iterative Teacher improvement is shown to be stable.

minor comments (2)

[Abstract] Abstract: the specific CXR VQA benchmarks and the magnitude of reported gains (e.g., accuracy deltas or statistical significance) are not named; adding one sentence would improve immediate readability.
[Method] The case-wise reward rule is described at a high level; a short pseudocode block or explicit condition (e.g., “if advantage = 0 then …”) in the methods section would clarify the switching logic for readers implementing the method.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the detailed and positive summary of our manuscript, the recognition of the significance of the case-wise reward design and self-competition mechanism, and the recommendation for minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents T2R as a framework that integrates comparison-based supervision through a self-improving Teacher and a case-wise reward switch. No equations, parameter fits, or derivations are shown that reduce the claimed improvement to a self-definition, a fitted input renamed as prediction, or a self-citation chain. The method description and experimental claims on external CXR VQA benchmarks remain independent of the inputs; the central claim does not collapse by construction to its own assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no details on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5744 in / 1037 out tokens · 40737 ms · 2026-06-25T20:57:44.676378+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 17 linked inside Pith

[1]

A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

2018
[2]

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

2021
[3]

Mimic-ext-mimic-cxr-vqa: a complex, diverse, and large-scale visual question answering dataset for chest x-ray images.PhysioNet, 2024

Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei JI, Eric Chang, Tackeun Kim, et al. Mimic-ext-mimic-cxr-vqa: a complex, diverse, and large-scale visual question answering dataset for chest x-ray images.PhysioNet, 2024

2024
[4]

Rexvqa: A large-scale visual question answering benchmark for generalist chest x-ray understanding

Ankit Pal, Jung-Oh Lee, Xiaoman Zhang, Malaikannan Sankarasubbu, Seunghyeon Roh, Won Jung Kim, Meesun Lee, and Pranav Rajpurkar. Rexvqa: A large-scale visual question answering benchmark for generalist chest x-ray understanding. In Biocomputing 2026: Proceedings of the Pacific Symposium, pages 251–264. World Scientific, 2025

2026
[5]

Medthink: A rationale-guided framework for explaining medical visual question answering

Xiaotang Gai, Chenyi Zhou, Jiaxiang Liu, Yang Feng, Jian Wu, and Zuozhu Liu. Medthink: A rationale-guided framework for explaining medical visual question answering. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 7438–7450, 2025

2025
[7]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[8]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36: 53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36: 53728–53741, 2023

2023
[9]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 11

Pith/arXiv arXiv 2024
[10]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

2025
[11]

Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

Pith/arXiv arXiv 2024
[12]

Chain-of- thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35: 24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35: 24824–24837, 2022

2022
[13]

Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

Pith/arXiv arXiv 2025
[14]

A survey on llm-as-a-judge.The Innovation, 2024

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

2024
[15]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

2023
[16]

Self-rewarding language models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason E Weston. Self-rewarding language models. InForty-first International Conference on Machine Learning, 2024

2024
[17]

Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

2022
[18]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

2023
[19]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023
[20]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[21]

Vqa-med: Overview of the medical visual question answering task at imageclef 2019

Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Dina Demner-Fushman, and Henning Müller. Vqa-med: Overview of the medical visual question answering task at imageclef 2019. InProceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019, 2019

2019
[22]

Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

Pith/arXiv arXiv 2003
[23]

Mmbert: Multimodal bert pretraining for improved medical vqa

Yash Khare, Viraj Bagal, Minesh Mathew, Adithi Devi, U Deva Priyakumar, and CV Jawahar. Mmbert: Multimodal bert pretraining for improved medical vqa. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1033–1036. IEEE, 2021

2021
[24]

Multi-modal masked autoencoders for medical vision-and-language pre-training

Zhihong Chen, Yuhao Du, Jinpeng Hu, Yang Liu, Guanbin Li, Xiang Wan, and Tsung-Hui Chang. Multi-modal masked autoencoders for medical vision-and-language pre-training. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 679–689. Springer, 2022

2022
[25]

Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915, 2023

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915, 2023

Pith/arXiv arXiv 2023
[26]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023. 12

2023
[27]

Med-flamingo: a multimodal medical few-shot learner

Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine learning for health (ML4H), pages 353–367. PMLR, 2023

2023
[28]

Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Pith/arXiv arXiv 2025
[29]

Medical visual question answering: A survey.Artificial Intelligence in Medicine, 143:102611, 2023

Zhihong Lin, Donghao Zhang, Qingyi Tao, Danli Shi, Gholamreza Haffari, Qi Wu, Mingguang He, and Zongyuan Ge. Medical visual question answering: A survey.Artificial Intelligence in Medicine, 143:102611, 2023

2023
[30]

Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

Pith/arXiv arXiv 2022
[31]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

2023
[32]

Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

Pith/arXiv arXiv 2025
[33]

Reinforcement learning with rubric anchors.arXiv preprint arXiv:2508.12790, 2025

Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, et al. Reinforcement learning with rubric anchors.arXiv preprint arXiv:2508.12790, 2025

arXiv 2025
[34]

Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399, 2025

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G Finlayson, David Sontag, et al. Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399, 2025

Pith/arXiv arXiv 2025
[35]

Chasing the tail: Effective rubric-based reward modeling for large language model post-training

Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, and Lifeng Jin. Chasing the tail: Effective rubric-based reward modeling for large language model post-training. arXiv preprint arXiv:2509.21500, 2025

arXiv 2025
[36]

Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general llm reasoning.arXiv preprint arXiv:2508.16949, 2025

Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Kongcheng Zhang, Jiale Zhao, Jingwen Yang, Yihe Zhou, Jianwei Lv, Tongya Zheng, et al. Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general llm reasoning.arXiv preprint arXiv:2508.16949, 2025

arXiv 2025
[37]

Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

Pith/arXiv arXiv 2026
[38]

Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

Pith/arXiv arXiv 2024
[39]

Serl: Self-play reinforcement learning for large language models with limited data.arXiv preprint arXiv:2505.20347, 2025

Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, and Dacheng Tao. Serl: Self-play reinforcement learning for large language models with limited data.arXiv preprint arXiv:2505.20347, 2025

arXiv 2025
[40]

Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025

Pith/arXiv arXiv 2025
[41]

Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

arXiv 2025
[42]

Search self-play: Pushing the frontier of agent capability without supervision.arXiv preprint arXiv:2510.18821, 2025

Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Jiaqi Guo, Haotian Xu, Chutian Wang, Haonan Chen, Xiaoxi Jiang, and Guanjun Jiang. Search self-play: Pushing the frontier of agent capability without supervision.arXiv preprint arXiv:2510.18821, 2025

Pith/arXiv arXiv 2025
[43]

The alignment waltz: Jointly training agents to collaborate for safety.arXiv preprint arXiv:2510.08240, 2025

Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason Weston, and Hongyuan Zhan. The alignment waltz: Jointly training agents to collaborate for safety.arXiv preprint arXiv:2510.08240, 2025. 13

Pith/arXiv arXiv 2025
[44]

Visplay: Self-evolving vision-language models from images.arXiv preprint arXiv:2511.15661, 2025

Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, and Yonghui Yang. Visplay: Self-evolving vision-language models from images.arXiv preprint arXiv:2511.15661, 2025

arXiv 2025
[45]

A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 1, 2025

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 1, 2025

Pith/arXiv arXiv 2025
[46]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[47]

Medical-cxr-vqa dataset: A large-scale llm-enhanced medical dataset for visual question answering on chest x-ray images, 2025

Xinyue Hu, Lin Gu, Kazuma Kobayashi, X Hu, L Gu, K Kobayashi, L Liu, M Zhang, T Harada, RM Summers, et al. Medical-cxr-vqa dataset: A large-scale llm-enhanced medical dataset for visual question answering on chest x-ray images, 2025

2025
[48]

Ehrxqa: A multi-modal question answering dataset for electronic health records with chest x-ray images

Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei Ji, Eric Chang, Tackeun Kim, et al. Ehrxqa: A multi-modal question answering dataset for electronic health records with chest x-ray images. Advances in Neural Information Processing Systems, 36:3867–3880, 2023

2023
[49]

Can ai help in screening viral and covid-19 pneumonia?Ieee Access, 8:132665–132676, 2020

Muhammad EH Chowdhury, Tawsifur Rahman, Amith Khandakar, Rashid Mazhar, Muhammad Abdul Kadir, Zaid Bin Mahbub, Khandakar Reajul Islam, Muhammad Salman Khan, Atif Iqbal, Nasser Al Emadi, et al. Can ai help in screening viral and covid-19 pneumonia?Ieee Access, 8:132665–132676, 2020

2020
[50]

Covid19 pneumonia and normal chest x-ray pa dataset

A Asraf and Z Islam. Covid19 pneumonia and normal chest x-ray pa dataset. mendeley data v1 (2021), 2021. 14 A LLM-as-a-Judge Details Unless otherwise specified, we use Qwen3-4B-Instruct-2507 as the judge model and deploy it with vLLM, using max_model_len=32768, gpu_mem_util=0.9, max_tokens=2048, temperature=0.7, and top_p=0.9. Across all judging tasks, th...

2021
[51]

Correctness and consistency * Does the reasoning logically support the official answer {answer}, instead of drifting toward a different answer? * Is it broadly consistent with the radiology report in terms of imaging findings and diagnostic direction? * Does it avoid obvious medical errors or reasoning that conflicts with basic chest imaging knowledge?
[52]

inspect the image → describe key abnormalities/locations/extent → relate to the question → rule out less likely options → reach the conclusion

Use of evidence and reasoning chain * Does it clearly build on plausible key findings that one could see on the chest X-ray, rather than just restating the answer or giving vague comments? * Does it show a reasonable reasoning flow, for example: “inspect the image → describe key abnormalities/locations/extent → relate to the question → rule out less likel...
[53]

quoting the report

Information use and restraint * Does it focus on the findings and information that are actually relevant to this specific question, instead of introducing large amounts of unrelated content? * Does it avoid clearly inventing findings or test results that are not supported by the image or the case? * If it implicitly aligns with ideas present in the radiol...
[54]

【Output requirements】

Clarity of expression * Is the language clear and easy to understand? * Does it clearly explain why the official answer makes sense, rather than merely restating the conclusion? * Is it concise but effective, without losing focus through excessive expansion? 【Special instructions for comparison】 * This is a quality comparison, not a length comparison; * L...
[55]

Internally compare reasoning A and B using the above criteria, and decide which one is overall better
[56]

A” or “B

Then output your decision and justification in XML format, with root tag <response> and two child tags: * <reason>: briefly explain how you compared A and B, in which aspects A is better or B is better, and why you finally chose one over the other. * <result>: write only a single capital letter, “A” or “B”, indicating which reasoning you judge to be bette...
[57]

- If the reference answer or the student answer conflicts with the report, the report has the highest priority for judging medical correctness

Chest X-ray report (Report) - This is the factual basis for the case. - If the reference answer or the student answer conflicts with the report, the report has the highest priority for judging medical correctness
[58]

- You should extract the core medical elements from it (e.g

Reference answer (Ground Truth) - Provided by the instructor; it reflects the key information that the question is intended to test. - You should extract the core medical elements from it (e.g. abnormal findings, diagnosis, location, cause, extent, severity, management advice)
[59]

- As long as the medical meaning is equivalent or very close, and it does not contradict the report, it can be graded as correct

Student answer (Pred) - The wording may differ from the reference answer. - As long as the medical meaning is equivalent or very close, and it does not contradict the report, it can be graded as correct. 【Grading principles】 You must decide a binary result (yes/no) based on the following:
[60]

most likely diagnosis

Focus on what the question is asking for - First, understand what type of information the question explicitly asks: - e.g. “most likely diagnosis”, “main abnormal finding”, “location of the lesion”, “possible cause”, “severity”, “management step”, etc. - The reference answer shows what the instructor really wants the student to provide
[61]

main diagnosis / main abnormality / most important change

Identify the core elements in the reference answer - The reference answer may contain one or several key points: - If the question asks for “main diagnosis / main abnormality / most important change”, the student answer should at least cover that main core element; minor omissions may be acceptable. - If the question explicitly asks for “all major abnorma...
[62]

something is wrong

When to mark the student as correct (result = "yes") - The student’s answer matches the core meaning of the reference answer: - Synonyms, paraphrases, and equivalent medical terminology are acceptable; - Shorter wording is acceptable if it still captures the key medical content. - The student answer must NOT: - Present a diagnosis/finding/location/cause t...
[63]

When to mark the student as incorrect (result = "no") - The student answer misses the key information required by the question; - The main content of the student answer conflicts with the reference answer and/or the report; - The answer is too general or ambiguous and does not demonstrate actual understanding of the required specific point; - The answer c...
[64]

- Be tolerant of minor phrasing differences that do not affect correctness

Leniency vs strictness - Do not require exact wording; judge based on medical meaning. - Be tolerant of minor phrasing differences that do not affect correctness. - Be strict regarding the main diagnostic direction, key location, main abnormality type, key cause, and other crucial elements. 【Output format】
[65]

First complete your reasoning internally; do NOT output your intermediate thoughts
[66]

Then output exactly one XML code block, wrapped in ```xml, with the following structure: ```xml <response> <reason>Use 2–5 sentences in English to briefly explain why you judged the student answer as correct or incorrect. Mention what the question asks for, what the core elements of the reference answer are, and whether the student answer matches them or ...
[67]

acceptable

Do NOT output anything outside this XML code block. No extra explanations, no additional code blocks. --- 【Chest X-ray report】 {report} --- 【Open-ended question】 {question} --- 【Reference (ground truth) answer】 {ground_truth} --- 【Student answer】 {pred} --- Based on the above information and grading principles, decide whether the student’s answer should b...
[68]

Consistency with the official answer - Does the reasoning ultimately support the given official answer {answer} (at least not contradict it in meaning)? - Does it clearly explain why this answer is reasonable, rather than implicitly suggesting that some other answer would be more appropriate?
[69]

Medical consistency with the case/report - Do the imaging findings, abnormalities, diagnostic tendencies, etc. mentioned in the reasoning broadly match the radiology report in terms of direction and key facts? - Does it avoid conclusions that are clearly opposite to the report or strongly violate basic chest imaging knowledge (e.g., treating obviously nor...
[70]

inspect the image → describe key findings → analyze in light of the question → rule out less likely possibilities → arrive at a conclusion that matches the official answer

Completeness and plausibility of the reasoning chain - Does it present a genuine reasoning process, rather than just a bare conclusion or a single sentence explanation? - Does it cover most of the following elements: “inspect the image → describe key findings → analyze in light of the question → rule out less likely possibilities → arrive at a conclusion ...
[71]

the report states…

Proper use of information - Does it mainly rely on findings that could reasonably be observed on the chest X-ray and on the information in the question, rather than inventing non-existent tests or imaging features? - Does it avoid obviously copying phrases from the radiology report, or explicitly revealing that it saw the report/answer (e.g., “the report ...
[72]

Clarity of explanation - Is the language clear enough that someone with basic medical background can understand what the reasoning is saying? - Does it stay focused on what this particular question is asking, without long digressions into irrelevant content?
[73]

acceptable

Language quality and readability - Is the text reasonably fluent, with basically correct grammar, and without so many typos, nonsense words, or scrambled word order that it becomes hard to understand? - Check whether the main language of the reasoning is English: - It may include some non-English technical terms or abbreviations, but the majority of the t...
[74]

First, complete your internal evaluation, then provide your judgment and justification in XML format
[75]

inspect the image(s) → describe key findings → analyze these findings in the context of the question → reach a conclusion

Use <response> as the root tag, with two child tags: - <reason>: briefly explain, in natural English, why you consider the reasoning acceptable or not acceptable. You may mention both strengths and major flaws. - <result>: write yes or no, indicating your final judgment on the reasoning trace. Important: - Do NOT output anything outside the XML structure....
[76]

**Localized density**: The mass is homogenous and uniformly dense (appearing white) with no air bronchograms or cavitation—this rules out pneumonic (air-filled) or embolic (gas-filled) lesions
[77]

**Well-circumscribed contour**: The mass has a sharp, defined border against the adjacent lung parenchyma, a hallmark of primary malignancy rather than infectious or inflammatory processes
[78]

**No surrounding vascular or pleural changes**: There is no blunting of the costophrenic angle, no pleural effusion, no mediastinal shift, and no interstitial thickening—features absent in pneumonia (which typically shows bronchial air bronchograms and consolidation), pulmonary embolism (which shows vascular calcification or ground-glass opacities), or tu...
[79]

**Absence of cavitation or cavitation-like structure**: The mass is solid and non-cavitating—consistent with a primary neoplasm, not a tuberculous lesion that may evolve into cavitation. Now analyzing the clinical history (though not fully described in the prompt), the presence of a **right upper lobe mass** in a patient with no signs of systemic infectio...

[1] [1]

A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

2018

[2] [2]

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

2021

[3] [3]

Mimic-ext-mimic-cxr-vqa: a complex, diverse, and large-scale visual question answering dataset for chest x-ray images.PhysioNet, 2024

Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei JI, Eric Chang, Tackeun Kim, et al. Mimic-ext-mimic-cxr-vqa: a complex, diverse, and large-scale visual question answering dataset for chest x-ray images.PhysioNet, 2024

2024

[4] [4]

Rexvqa: A large-scale visual question answering benchmark for generalist chest x-ray understanding

Ankit Pal, Jung-Oh Lee, Xiaoman Zhang, Malaikannan Sankarasubbu, Seunghyeon Roh, Won Jung Kim, Meesun Lee, and Pranav Rajpurkar. Rexvqa: A large-scale visual question answering benchmark for generalist chest x-ray understanding. In Biocomputing 2026: Proceedings of the Pacific Symposium, pages 251–264. World Scientific, 2025

2026

[5] [5]

Medthink: A rationale-guided framework for explaining medical visual question answering

Xiaotang Gai, Chenyi Zhou, Jiaxiang Liu, Yang Feng, Jian Wu, and Zuozhu Liu. Medthink: A rationale-guided framework for explaining medical visual question answering. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 7438–7450, 2025

2025

[6] [7]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022

[7] [8]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36: 53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36: 53728–53741, 2023

2023

[8] [9]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 11

Pith/arXiv arXiv 2024

[9] [10]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

2025

[10] [11]

Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

Pith/arXiv arXiv 2024

[11] [12]

Chain-of- thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35: 24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of- thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35: 24824–24837, 2022

2022

[12] [13]

Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

Pith/arXiv arXiv 2025

[13] [14]

A survey on llm-as-a-judge.The Innovation, 2024

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

2024

[14] [15]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

2023

[15] [16]

Self-rewarding language models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason E Weston. Self-rewarding language models. InForty-first International Conference on Machine Learning, 2024

2024

[16] [17]

Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

2022

[17] [18]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

2023

[18] [19]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023

[19] [20]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[20] [21]

Vqa-med: Overview of the medical visual question answering task at imageclef 2019

Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Dina Demner-Fushman, and Henning Müller. Vqa-med: Overview of the medical visual question answering task at imageclef 2019. InProceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019, 2019

2019

[21] [22]

Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

Pith/arXiv arXiv 2003

[22] [23]

Mmbert: Multimodal bert pretraining for improved medical vqa

Yash Khare, Viraj Bagal, Minesh Mathew, Adithi Devi, U Deva Priyakumar, and CV Jawahar. Mmbert: Multimodal bert pretraining for improved medical vqa. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1033–1036. IEEE, 2021

2021

[23] [24]

Multi-modal masked autoencoders for medical vision-and-language pre-training

Zhihong Chen, Yuhao Du, Jinpeng Hu, Yang Liu, Guanbin Li, Xiang Wan, and Tsung-Hui Chang. Multi-modal masked autoencoders for medical vision-and-language pre-training. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 679–689. Springer, 2022

2022

[24] [25]

Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915, 2023

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915, 2023

Pith/arXiv arXiv 2023

[25] [26]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023. 12

2023

[26] [27]

Med-flamingo: a multimodal medical few-shot learner

Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine learning for health (ML4H), pages 353–367. PMLR, 2023

2023

[27] [28]

Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

Pith/arXiv arXiv 2025

[28] [29]

Medical visual question answering: A survey.Artificial Intelligence in Medicine, 143:102611, 2023

Zhihong Lin, Donghao Zhang, Qingyi Tao, Danli Shi, Gholamreza Haffari, Qi Wu, Mingguang He, and Zongyuan Ge. Medical visual question answering: A survey.Artificial Intelligence in Medicine, 143:102611, 2023

2023

[29] [30]

Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

Pith/arXiv arXiv 2022

[30] [31]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

2023

[31] [32]

Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

Pith/arXiv arXiv 2025

[32] [33]

Reinforcement learning with rubric anchors.arXiv preprint arXiv:2508.12790, 2025

Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, et al. Reinforcement learning with rubric anchors.arXiv preprint arXiv:2508.12790, 2025

arXiv 2025

[33] [34]

Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399, 2025

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G Finlayson, David Sontag, et al. Dr tulu: Reinforcement learning with evolving rubrics for deep research.arXiv preprint arXiv:2511.19399, 2025

Pith/arXiv arXiv 2025

[34] [35]

Chasing the tail: Effective rubric-based reward modeling for large language model post-training

Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, and Lifeng Jin. Chasing the tail: Effective rubric-based reward modeling for large language model post-training. arXiv preprint arXiv:2509.21500, 2025

arXiv 2025

[35] [36]

Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general llm reasoning.arXiv preprint arXiv:2508.16949, 2025

Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Kongcheng Zhang, Jiale Zhao, Jingwen Yang, Yihe Zhou, Jianwei Lv, Tongya Zheng, et al. Breaking the exploration bottleneck: Rubric-scaffolded reinforcement learning for general llm reasoning.arXiv preprint arXiv:2508.16949, 2025

arXiv 2025

[36] [37]

Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

Pith/arXiv arXiv 2026

[37] [38]

Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models.arXiv preprint arXiv:2401.01335, 2024

Pith/arXiv arXiv 2024

[38] [39]

Serl: Self-play reinforcement learning for large language models with limited data.arXiv preprint arXiv:2505.20347, 2025

Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, and Dacheng Tao. Serl: Self-play reinforcement learning for large language models with limited data.arXiv preprint arXiv:2505.20347, 2025

arXiv 2025

[39] [40]

Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025

Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025

Pith/arXiv arXiv 2025

[40] [41]

Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

arXiv 2025

[41] [42]

Search self-play: Pushing the frontier of agent capability without supervision.arXiv preprint arXiv:2510.18821, 2025

Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Jiaqi Guo, Haotian Xu, Chutian Wang, Haonan Chen, Xiaoxi Jiang, and Guanjun Jiang. Search self-play: Pushing the frontier of agent capability without supervision.arXiv preprint arXiv:2510.18821, 2025

Pith/arXiv arXiv 2025

[42] [43]

The alignment waltz: Jointly training agents to collaborate for safety.arXiv preprint arXiv:2510.08240, 2025

Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason Weston, and Hongyuan Zhan. The alignment waltz: Jointly training agents to collaborate for safety.arXiv preprint arXiv:2510.08240, 2025. 13

Pith/arXiv arXiv 2025

[43] [44]

Visplay: Self-evolving vision-language models from images.arXiv preprint arXiv:2511.15661, 2025

Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, and Yonghui Yang. Visplay: Self-evolving vision-language models from images.arXiv preprint arXiv:2511.15661, 2025

arXiv 2025

[44] [45]

A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 1, 2025

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 1, 2025

Pith/arXiv arXiv 2025

[45] [46]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[46] [47]

Medical-cxr-vqa dataset: A large-scale llm-enhanced medical dataset for visual question answering on chest x-ray images, 2025

Xinyue Hu, Lin Gu, Kazuma Kobayashi, X Hu, L Gu, K Kobayashi, L Liu, M Zhang, T Harada, RM Summers, et al. Medical-cxr-vqa dataset: A large-scale llm-enhanced medical dataset for visual question answering on chest x-ray images, 2025

2025

[47] [48]

Ehrxqa: A multi-modal question answering dataset for electronic health records with chest x-ray images

Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei Ji, Eric Chang, Tackeun Kim, et al. Ehrxqa: A multi-modal question answering dataset for electronic health records with chest x-ray images. Advances in Neural Information Processing Systems, 36:3867–3880, 2023

2023

[48] [49]

Can ai help in screening viral and covid-19 pneumonia?Ieee Access, 8:132665–132676, 2020

Muhammad EH Chowdhury, Tawsifur Rahman, Amith Khandakar, Rashid Mazhar, Muhammad Abdul Kadir, Zaid Bin Mahbub, Khandakar Reajul Islam, Muhammad Salman Khan, Atif Iqbal, Nasser Al Emadi, et al. Can ai help in screening viral and covid-19 pneumonia?Ieee Access, 8:132665–132676, 2020

2020

[49] [50]

Covid19 pneumonia and normal chest x-ray pa dataset

A Asraf and Z Islam. Covid19 pneumonia and normal chest x-ray pa dataset. mendeley data v1 (2021), 2021. 14 A LLM-as-a-Judge Details Unless otherwise specified, we use Qwen3-4B-Instruct-2507 as the judge model and deploy it with vLLM, using max_model_len=32768, gpu_mem_util=0.9, max_tokens=2048, temperature=0.7, and top_p=0.9. Across all judging tasks, th...

2021

[50] [51]

Correctness and consistency * Does the reasoning logically support the official answer {answer}, instead of drifting toward a different answer? * Is it broadly consistent with the radiology report in terms of imaging findings and diagnostic direction? * Does it avoid obvious medical errors or reasoning that conflicts with basic chest imaging knowledge?

[51] [52]

inspect the image → describe key abnormalities/locations/extent → relate to the question → rule out less likely options → reach the conclusion

Use of evidence and reasoning chain * Does it clearly build on plausible key findings that one could see on the chest X-ray, rather than just restating the answer or giving vague comments? * Does it show a reasonable reasoning flow, for example: “inspect the image → describe key abnormalities/locations/extent → relate to the question → rule out less likel...

[52] [53]

quoting the report

Information use and restraint * Does it focus on the findings and information that are actually relevant to this specific question, instead of introducing large amounts of unrelated content? * Does it avoid clearly inventing findings or test results that are not supported by the image or the case? * If it implicitly aligns with ideas present in the radiol...

[53] [54]

【Output requirements】

Clarity of expression * Is the language clear and easy to understand? * Does it clearly explain why the official answer makes sense, rather than merely restating the conclusion? * Is it concise but effective, without losing focus through excessive expansion? 【Special instructions for comparison】 * This is a quality comparison, not a length comparison; * L...

[54] [55]

Internally compare reasoning A and B using the above criteria, and decide which one is overall better

[55] [56]

A” or “B

Then output your decision and justification in XML format, with root tag <response> and two child tags: * <reason>: briefly explain how you compared A and B, in which aspects A is better or B is better, and why you finally chose one over the other. * <result>: write only a single capital letter, “A” or “B”, indicating which reasoning you judge to be bette...

[56] [57]

- If the reference answer or the student answer conflicts with the report, the report has the highest priority for judging medical correctness

Chest X-ray report (Report) - This is the factual basis for the case. - If the reference answer or the student answer conflicts with the report, the report has the highest priority for judging medical correctness

[57] [58]

- You should extract the core medical elements from it (e.g

Reference answer (Ground Truth) - Provided by the instructor; it reflects the key information that the question is intended to test. - You should extract the core medical elements from it (e.g. abnormal findings, diagnosis, location, cause, extent, severity, management advice)

[58] [59]

- As long as the medical meaning is equivalent or very close, and it does not contradict the report, it can be graded as correct

Student answer (Pred) - The wording may differ from the reference answer. - As long as the medical meaning is equivalent or very close, and it does not contradict the report, it can be graded as correct. 【Grading principles】 You must decide a binary result (yes/no) based on the following:

[59] [60]

most likely diagnosis

Focus on what the question is asking for - First, understand what type of information the question explicitly asks: - e.g. “most likely diagnosis”, “main abnormal finding”, “location of the lesion”, “possible cause”, “severity”, “management step”, etc. - The reference answer shows what the instructor really wants the student to provide

[60] [61]

main diagnosis / main abnormality / most important change

Identify the core elements in the reference answer - The reference answer may contain one or several key points: - If the question asks for “main diagnosis / main abnormality / most important change”, the student answer should at least cover that main core element; minor omissions may be acceptable. - If the question explicitly asks for “all major abnorma...

[61] [62]

something is wrong

When to mark the student as correct (result = "yes") - The student’s answer matches the core meaning of the reference answer: - Synonyms, paraphrases, and equivalent medical terminology are acceptable; - Shorter wording is acceptable if it still captures the key medical content. - The student answer must NOT: - Present a diagnosis/finding/location/cause t...

[62] [63]

When to mark the student as incorrect (result = "no") - The student answer misses the key information required by the question; - The main content of the student answer conflicts with the reference answer and/or the report; - The answer is too general or ambiguous and does not demonstrate actual understanding of the required specific point; - The answer c...

[63] [64]

- Be tolerant of minor phrasing differences that do not affect correctness

Leniency vs strictness - Do not require exact wording; judge based on medical meaning. - Be tolerant of minor phrasing differences that do not affect correctness. - Be strict regarding the main diagnostic direction, key location, main abnormality type, key cause, and other crucial elements. 【Output format】

[64] [65]

First complete your reasoning internally; do NOT output your intermediate thoughts

[65] [66]

Then output exactly one XML code block, wrapped in ```xml, with the following structure: ```xml <response> <reason>Use 2–5 sentences in English to briefly explain why you judged the student answer as correct or incorrect. Mention what the question asks for, what the core elements of the reference answer are, and whether the student answer matches them or ...

[66] [67]

acceptable

Do NOT output anything outside this XML code block. No extra explanations, no additional code blocks. --- 【Chest X-ray report】 {report} --- 【Open-ended question】 {question} --- 【Reference (ground truth) answer】 {ground_truth} --- 【Student answer】 {pred} --- Based on the above information and grading principles, decide whether the student’s answer should b...

[67] [68]

Consistency with the official answer - Does the reasoning ultimately support the given official answer {answer} (at least not contradict it in meaning)? - Does it clearly explain why this answer is reasonable, rather than implicitly suggesting that some other answer would be more appropriate?

[68] [69]

Medical consistency with the case/report - Do the imaging findings, abnormalities, diagnostic tendencies, etc. mentioned in the reasoning broadly match the radiology report in terms of direction and key facts? - Does it avoid conclusions that are clearly opposite to the report or strongly violate basic chest imaging knowledge (e.g., treating obviously nor...

[69] [70]

inspect the image → describe key findings → analyze in light of the question → rule out less likely possibilities → arrive at a conclusion that matches the official answer

Completeness and plausibility of the reasoning chain - Does it present a genuine reasoning process, rather than just a bare conclusion or a single sentence explanation? - Does it cover most of the following elements: “inspect the image → describe key findings → analyze in light of the question → rule out less likely possibilities → arrive at a conclusion ...

[70] [71]

the report states…

Proper use of information - Does it mainly rely on findings that could reasonably be observed on the chest X-ray and on the information in the question, rather than inventing non-existent tests or imaging features? - Does it avoid obviously copying phrases from the radiology report, or explicitly revealing that it saw the report/answer (e.g., “the report ...

[71] [72]

Clarity of explanation - Is the language clear enough that someone with basic medical background can understand what the reasoning is saying? - Does it stay focused on what this particular question is asking, without long digressions into irrelevant content?

[72] [73]

acceptable

Language quality and readability - Is the text reasonably fluent, with basically correct grammar, and without so many typos, nonsense words, or scrambled word order that it becomes hard to understand? - Check whether the main language of the reasoning is English: - It may include some non-English technical terms or abbreviations, but the majority of the t...

[73] [74]

First, complete your internal evaluation, then provide your judgment and justification in XML format

[74] [75]

inspect the image(s) → describe key findings → analyze these findings in the context of the question → reach a conclusion

Use <response> as the root tag, with two child tags: - <reason>: briefly explain, in natural English, why you consider the reasoning acceptable or not acceptable. You may mention both strengths and major flaws. - <result>: write yes or no, indicating your final judgment on the reasoning trace. Important: - Do NOT output anything outside the XML structure....

[75] [76]

**Localized density**: The mass is homogenous and uniformly dense (appearing white) with no air bronchograms or cavitation—this rules out pneumonic (air-filled) or embolic (gas-filled) lesions

[76] [77]

**Well-circumscribed contour**: The mass has a sharp, defined border against the adjacent lung parenchyma, a hallmark of primary malignancy rather than infectious or inflammatory processes

[77] [78]

**No surrounding vascular or pleural changes**: There is no blunting of the costophrenic angle, no pleural effusion, no mediastinal shift, and no interstitial thickening—features absent in pneumonia (which typically shows bronchial air bronchograms and consolidation), pulmonary embolism (which shows vascular calcification or ground-glass opacities), or tu...

[78] [79]

**Absence of cavitation or cavitation-like structure**: The mass is solid and non-cavitating—consistent with a primary neoplasm, not a tuberculous lesion that may evolve into cavitation. Now analyzing the clinical history (though not fully described in the prompt), the presence of a **right upper lobe mass** in a patient with no signs of systemic infectio...