pith. sign in

arxiv: 2607.01440 · v1 · pith:OSS5ZEB6new · submitted 2026-07-01 · 💻 cs.CL

FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning

Pith reviewed 2026-07-03 21:05 UTC · model grok-4.3

classification 💻 cs.CL
keywords medical reasoningLLM trainingreinforcement learningevidence-based medicinefaithful reasoningprocess rewardsmedical benchmarks
0
0 comments X

The pith

FaithMed trains medical LLMs by assigning rewards to each reasoning step against clinician rubrics for evidence use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current medical LLMs retrieve evidence but lack supervision on how to appraise and apply it during reasoning. FaithMed formalizes evidence-based medicine principles into process-level rubrics designed by clinicians and refined automatically. These rubrics drive reinforcement learning that assigns rewards at the step level with advantage grouping. On seven medical benchmarks the method raises task performance over agentic-search baselines by 9 percent on average and over outcome-only RL by 5.8 percent, while lifting average rubric scores by 15.5 percent. The work therefore claims that explicit step-level supervision improves both accuracy and the faithfulness of the generated reasoning.

Core claim

FaithMed improves average performance by 9 percent over agentic-search baselines and 5.8 percent over outcome-only RL across seven medical benchmarks, while increasing evidence-based medicine rubric scores by 15.5 percent over agentic-search Qwen3 baselines. The framework achieves this by combining clinician-designed rubrics with reinforcement learning that uses step-level process reward assignment and advantage grouping.

What carries the argument

Clinician-designed rubrics that encode process-level criteria for evidence appraisal and application, paired with step-level reward assignment inside reinforcement learning.

If this is right

  • Medical LLMs trained this way produce both higher task accuracy and more transparent, evidence-grounded justifications.
  • Step-level process rewards outperform training that rewards only final outcomes.
  • The same rubric-plus-step-reward approach can be applied to any benchmark where process faithfulness matters.
  • Explicit supervision of evidence use reduces reliance on ungrounded reasoning steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the rubrics capture core clinical standards, the same training loop could transfer to legal or scientific reasoning tasks that also require traceable evidence use.
  • Models trained under step-level rewards may show lower rates of unsupported claims when deployed in decision-support tools.
  • Testing the trained models on live clinical vignettes with outcome tracking would reveal whether higher rubric scores translate to better patient-level decisions.

Load-bearing premise

The clinician-designed rubrics, even after automatic refinement, provide a valid and unbiased measure of reasoning faithfulness that generalizes across the seven benchmarks.

What would settle it

An independent clinician panel rates reasoning traces on new cases and finds that rubric scores do not predict human judgments of faithfulness or that the reported performance gains vanish under the new ratings.

Figures

Figures reproduced from arXiv: 2607.01440 by Chenyan Xiong, Liwen Sun, Xiang Qian, Zhiyun Zhang.

Figure 1
Figure 1. Figure 1: Comparing unfaithful reasoning with evidence-based medical reasoning. medical question answering (QA) and clinical rea￾soning, motivating the development of automated systems that can assist clinicians in interpreting medical evidence and supporting healthcare deci￾sions (Jin et al., 2019; Singhal et al., 2023b; Sun et al., 2024). However, ensuring that these sys￾tems produce faithful and evidence-grounded… view at source ↗
Figure 2
Figure 2. Figure 2: Method overview. We first design and refine rubrics for evaluating the reasoning process across five [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Rubric-score curves across seven medical benchmarks. Scores are averaged over the five rubric dimensions [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Diagnostics for reward assignment (RA) and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Faithful reasoning is essential in medicine, where clinical decisions require transparent justification grounded in reliable evidence. Current medical LLMs either lack active access to evidence or use retrieved evidence without supervising how it should be appraised and applied during reasoning. To address this, we formalize evidence-based medicine principles as process-level criteria and introduce FaithMed, a framework that combines clinician-designed, automatically refined rubrics with reinforcement learning using step-level process reward assignment and advantage grouping. Across seven medical benchmarks, FaithMed improves over agentic-search baselines (+9% on average) and outcome-only RL (+5.8%), while raising average evidence-based medicine rubric scores over agentic-search Qwen3 baselines (+15.5%). This work demonstrates that explicit step-level supervision can improve both task success and the faithfulness of the reasoning process. Code is available at https://github.com/cxcscmu/FaithMed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces FaithMed, a framework that formalizes evidence-based medicine (EBM) principles as process-level criteria and trains LLMs via reinforcement learning with step-level process rewards derived from clinician-designed, automatically refined rubrics. It reports average gains of +9% over agentic-search baselines and +5.8% over outcome-only RL across seven medical benchmarks, plus a +15.5% increase in average EBM rubric scores versus agentic-search Qwen3 baselines, claiming that explicit step-level supervision improves both task success and reasoning faithfulness. Code is released at https://github.com/cxcscmu/FaithMed.

Significance. If the faithfulness gains can be shown to hold under evaluation independent of the training rubrics, the work would provide concrete evidence that process-level supervision grounded in EBM principles can simultaneously boost benchmark performance and the transparency of medical reasoning in LLMs. The public code release is a clear strength that supports reproducibility.

major comments (1)
  1. [Abstract] Abstract: The reported +15.5% improvement in average EBM rubric scores is obtained using the same family of clinician-designed (automatically refined) rubrics that supply the step-level process rewards during RL training. Because the evaluation metric is the direct optimization target, the faithfulness gain is at least partly by construction; an independent held-out rubric, human clinician ratings on a separate scale, or cross-benchmark rubric transfer experiment is required to substantiate the central claim that step-level supervision improves faithfulness rather than merely fitting the training signal.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation of reasoning faithfulness. We address the major comment point-by-point below and will revise the manuscript to incorporate the suggested clarifications and additional validation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The reported +15.5% improvement in average EBM rubric scores is obtained using the same family of clinician-designed (automatically refined) rubrics that supply the step-level process rewards during RL training. Because the evaluation metric is the direct optimization target, the faithfulness gain is at least partly by construction; an independent held-out rubric, human clinician ratings on a separate scale, or cross-benchmark rubric transfer experiment is required to substantiate the central claim that step-level supervision improves faithfulness rather than merely fitting the training signal.

    Authors: We agree that the +15.5% gain on EBM rubric scores is measured with the same clinician-designed rubrics used to generate the process-level rewards, so this particular metric improvement is partly by construction. The task-performance gains (+9% over agentic-search baselines and +5.8% over outcome-only RL) remain independent of the rubrics and provide separate evidence that the process supervision yields more effective reasoning. In the revision we will explicitly note this limitation of the current faithfulness evaluation and add results from at least one independent protocol (e.g., held-out rubric transfer or human clinician ratings on a held-out sample of traces) to better substantiate the faithfulness claim. revision: yes

Circularity Check

1 steps flagged

Rubric scores used for both RL training and evaluation risk non-independent measurement of faithfulness

specific steps
  1. fitted input called prediction [Abstract]
    "we formalize evidence-based medicine principles as process-level criteria and introduce FaithMed, a framework that combines clinician-designed, automatically refined rubrics with reinforcement learning using step-level process reward assignment and advantage grouping. Across seven medical benchmarks, FaithMed improves over agentic-search baselines (+9% on average) and outcome-only RL (+5.8%), while raising average evidence-based medicine rubric scores over agentic-search Qwen3 baselines (+15.5%)."

    The rubrics supply the process rewards that drive RL optimization; the identical rubric family is then used to compute the reported +15.5% EBM rubric score improvement. Because the evaluation metric is the training reward signal, the faithfulness gain is a direct consequence of successful optimization against that signal rather than an independent test.

full rationale

The paper's central empirical claim rests on two reported gains: task accuracy on external medical benchmarks (+9% avg) and EBM rubric scores (+15.5%). The latter is obtained by optimizing the policy directly against step-level rewards derived from the same clinician-designed, automatically refined rubrics that are later used to compute the evaluation scores. While the benchmark accuracy numbers are independent, the faithfulness improvement reduces to the training objective by construction and therefore cannot serve as external validation of the rubrics as an unbiased proxy. This matches the fitted-input-called-prediction pattern at the level of the reward signal rather than a mathematical equation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5682 in / 1042 out tokens · 23936 ms · 2026-07-03T21:05:55.101710+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 35 canonical work pages · 11 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [9]

    2015 , publisher=

    Improving Diagnosis in Health Care , author=. 2015 , publisher=

  9. [10]

    2024 , eprint=

    ED-Copilot: Reduce Emergency Department Wait Time with Language Model Diagnostic Assistance , author=. 2024 , eprint=

  10. [11]

    2023 , eprint=

    Towards Expert-Level Medical Question Answering with Large Language Models , author=. 2023 , eprint=

  11. [13]

    Applied Sciences , volume =

    What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams , author =. Applied Sciences , volume =. 2021 , doi =

  12. [15]

    Proceedings of the Conference on Health, Inference, and Learning , pages =

    MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering , author =. Proceedings of the Conference on Health, Inference, and Learning , pages =. 2022 , editor =

  13. [16]

    Advances in Neural Information Processing Systems 37 , year =

    MedCalc-Bench: Evaluating Large Language Models for Medical Calculations , author =. Advances in Neural Information Processing Systems 37 , year =

  14. [17]

    2024 , url =

    Wang, Yubo and Ma, Xueguang and Zhang, Ge and Ni, Yuansheng and Chandra, Abhranil and Guo, Shiguang and Ren, Weiming and Arulraj, Aaran and He, Xuan and Jiang, Ziyan and Li, Tianle and Ku, Max and Wang, Kai and Zhuang, Alex and Fan, Rongqi and Yue, Xiang and Chen, Wenhu , booktitle =. 2024 , url =

  15. [20]

    2026 , eprint =

    Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks , author =. 2026 , eprint =

  16. [21]

    The Western Journal of Emergency Medicine , volume=

    Clinical Reasoning: Defining It, Teaching It, Assessing It, Studying It , author=. The Western Journal of Emergency Medicine , volume=. 2017 , publisher=

  17. [22]

    Information , VOLUME =

    Yang, Hua and Li, Shilong and Gonçalves, Teresa , TITLE =. Information , VOLUME =. 2024 , NUMBER =

  18. [23]

    2025 , eprint=

    MedS ^3 : Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision , author=. 2025 , eprint=

  19. [24]

    2025 , eprint=

    Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards , author=. 2025 , eprint=

  20. [25]

    2024 , eprint=

    HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs , author=. 2024 , eprint=

  21. [26]

    2023 , eprint=

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2023 , eprint=

  22. [27]

    2023 , eprint=

    Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine , author=. 2023 , eprint=

  23. [28]

    2024 , eprint=

    Effective Large Language Model Adaptation for Improved Grounding and Citation Generation , author=. 2024 , eprint=

  24. [30]

    2025 , eprint=

    Fact-Aware Multimodal Retrieval Augmentation for Accurate Medical Radiology Report Generation , author=. 2025 , eprint=

  25. [31]

    2021 , eprint=

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

  26. [32]

    2024 , eprint=

    Capabilities of Gemini Models in Medicine , author=. 2024 , eprint=

  27. [33]

    2023 , eprint=

    MEDITRON-70B: Scaling Medical Pretraining for Large Language Models , author=. 2023 , eprint=

  28. [34]

    , title =

    Schwartz, Alan and Elstein, Arthur S. , title =. Clinical Reasoning in the Health Professions , editor =

  29. [35]

    Healthcare , volume=

    ChatGPT and the Future of Digital Health: A Study on Healthcare Workers' Perceptions and Expectations , author=. Healthcare , volume=. 2023 , publisher=

  30. [38]

    2019 , url =

    Subramanya, Suhas Jayaram and Devvrit, Fnu and Simhadri, Harsha Vardhan and Krishnawamy, Ravishankar and Kadekodi, Rohan , booktitle =. 2019 , url =

  31. [42]

    Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, Y. K. and Wu, Y. and Guo, Daya , journal=. 2024 , eprint=

  32. [43]

    2025 , eprint=

    Jin, Bowen and Zeng, Hansi and Yue, Zhenrui and Yoon, Jinsung and Arik, Sercan and Wang, Dong and Zamani, Hamed and Han, Jiawei , journal=. 2025 , eprint=

  33. [45]

    Group-in-Group Policy Optimization for

    Feng, Lang and Xue, Zhenghai and Liu, Tingcong and An, Bo , journal=. Group-in-Group Policy Optimization for. 2025 , eprint=

  34. [46]

    and Qiu, Xinchi and Whitehouse, Chenxi and Alazraki, Lisa and Goel, Shashwat and Barbieri, Francesco and Willi, Timon and Mathur, Akhil and Leontiadis, Ilias , journal=

    Shen, William F. and Qiu, Xinchi and Whitehouse, Chenxi and Alazraki, Lisa and Goel, Shashwat and Barbieri, Francesco and Willi, Timon and Mathur, Akhil and Leontiadis, Ilias , journal=. Rethinking Rubric Generation for Improving

  35. [47]

    2025 , address=

    Lee, Yukyung and Kim, JoongHoon and Kim, Jaehee and Cho, Hyowon and Kang, Jaewook and Kang, Pilsung and Kim, Najoung , booktitle=. 2025 , address=

  36. [49]

    Hanjie Chen, Zhouxiang Fang, Yash Singla, and Mark Dredze. 2025. https://doi.org/10.18653/v1/2025.naacl-long.182 Benchmarking large language models on answering and explaining challenging medical questions . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technol...

  37. [50]

    Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. 2024. https://arxiv.org/abs/2412.18925 Huatuogpt-o1, towards medical complex reasoning with llms . Preprint, arXiv:2412.18925

  38. [51]

    Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vinitra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, and Antoine Bosselut. 2023. https:/...

  39. [52]

    Jarrod Dusin, Andrea Melanson, and Lisa Mische-Lawson. 2023. https://doi.org/10.1136/bmjopen-2022-071188 Evidence-based practice models and frameworks in the healthcare setting: a scoping review . BMJ Open, 13(5):e071188

  40. [53]

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. 2025. https://arxiv.org/abs/2505.10978 Group-in-group policy optimization for LLM agent training . arXiv preprint arXiv:2505.10978

  41. [54]

    Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2022. https://arxiv.org/abs/2203.05765 Tevatron: An efficient and flexible toolkit for dense retrieval . arXiv preprint arXiv:2203.05765

  42. [55]

    Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.398 Enabling large language models to generate text with citations . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465--6488, Singapore. Association for Computational Linguistics

  43. [56]

    Larry D Gruppen. 2017. https://doi.org/10.5811/westjem.2016.11.33191 Clinical reasoning: Defining it, teaching it, assessing it, studying it . The Western Journal of Emergency Medicine, 18(1):4--7

  44. [57]

    Siqing Huo, Negar Arabzadeh, and Charles Clarke. 2023. https://doi.org/10.1145/3624918.3625336 Retrieving supporting evidence for generative question answering . In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, SIGIR-AP ’23, page 11–20. ACM

  45. [58]

    Shuyang Jiang, Yusheng Liao, Zhe Chen, Ya Zhang, Yanfeng Wang, and Yu Wang. 2025. https://arxiv.org/abs/2501.12051 Meds ^3 : Towards medical slow thinking with self-evolved soft dual-sided process supervision . Preprint, arXiv:2501.12051

  46. [59]

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. https://arxiv.org/abs/2503.09516 Search-R1 : Training LLMs to reason and leverage search engines with reinforcement learning . arXiv preprint arXiv:2503.09516

  47. [60]

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. https://doi.org/10.3390/app11146421 What disease does this patient have? a large-scale open domain question answering dataset from medical exams . Applied Sciences, 11(14):6421

  48. [61]

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. https://doi.org/10.18653/v1/D19-1259 P ub M ed QA : A dataset for biomedical research question answering . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-I...

  49. [62]

    Applebaum, Zain Anwar, Maame Sarfo-Gyamfi, Conrad W

    Nikhil Khandekar, Qiao Jin, Guangzhi Xiong, Soren Dunn, Serina S. Applebaum, Zain Anwar, Maame Sarfo-Gyamfi, Conrad W. Safranek, Abid A. Anwar, Andrew Zhang, Aidan Gilson, Maxwell B. Singer, Amisha Dave, Andrew Taylor, Aidong Zhang, Qingyu Chen, and Zhiyong Lu. 2024. https://papers.nips.cc/paper_files/paper/2024/hash/99e81750f3fdfcaf9613db2dbf4bd623-Abstr...

  50. [63]

    Yukyung Lee, JoongHoon Kim, Jaehee Kim, Hyowon Cho, Jaewook Kang, Pilsung Kang, and Najoung Kim. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.796 C heck E val: A reliable LLM -as-a-judge framework for evaluating text generation using checklists . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15771--1...

  51. [64]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. https://arxiv.org/abs/2005.11401 Retrieval-augmented generation for knowledge-intensive nlp tasks . Preprint, arXiv:2005.11401

  52. [65]

    Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, and Eric Horvitz. 2023. https://arxiv.org/abs/2311.16452 Can generalist foundation models outcompete special-p...

  53. [66]

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. https://proceedings.mlr.press/v174/pal22a.html Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering . In Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 248--260. PMLR

  54. [67]

    Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, Fan Zhang, Tim Strother, Chunjong Park, Elahe Vedadi, Juanma Zambrano Chaves, Szu-Yeu Hu, Mike Schaekermann, Aishwarya Kamath, Yong Cheng, David G. T. Barrett, Cathy Cheung, Basil Mustafa, Anil Palepu, Daniel McDuff, Le Hou, Tomer Golany, Luyang Liu, Jean baptiste Alayrac, Nei...

  55. [68]

    David L Sackett, William MC Rosenberg, J A Muir Gray, R Brian Haynes, and W Scott Richardson. 1996. https://doi.org/10.1136/bmj.312.7023.71 Evidence based medicine: what it is and what it isn't . BMJ, 312(7023):71--72

  56. [69]

    Alan Schwartz and Arthur S. Elstein. 2008. Clinical reasoning in medicine. In Joy Higgs, Mark A. Jones, Stephen Loftus, and Nicole Christensen, editors, Clinical Reasoning in the Health Professions, 3 edition. Elsevier, Edinburgh

  57. [70]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. https://arxiv.org/abs/2402.03300 DeepSeekMath : Pushing the limits of mathematical reasoning in open language models . arXiv preprint arXiv:2402.03300

  58. [71]

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. https://doi.org/10.1145/3689031.3696075 Hybridflow: A flexible and efficient RLHF framework . In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys '25, Rotterdam, Netherlands. Association for Computing Machinery

  59. [72]

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Agüera y Arcas, Dale Webster, Greg S. Corrado,...

  60. [73]

    Towards Expert-Level Medical Question Answering with Large Language Models

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, Mike Schaekermann, Amy Wang, Mohamed Amin, Sami Lachgar, Philip Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Aguera y Arcas, Nenad Tomasev, Yun Liu, Renee Wong, Christopher Semturs, S. Sara Mahdavi,...

  61. [74]

    Suhas Jayaram Subramanya, Fnu Devvrit, Harsha Vardhan Simhadri, Ravishankar Krishnawamy, and Rohan Kadekodi. 2019. https://papers.nips.cc/paper/9527-diskann-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node DiskANN : Fast accurate billion-point nearest neighbor search on a single node . In Advances in Neural Information Processing Systems 32

  62. [75]

    Liwen Sun, Abhineet Agarwal, Aaron Kornblith, Bin Yu, and Chenyan Xiong. 2024. https://arxiv.org/abs/2402.13448 Ed-copilot: Reduce emergency department wait time with language model diagnostic assistance . Preprint, arXiv:2402.13448

  63. [76]

    Liwen Sun, James Zhao, Megan Han, and Chenyan Xiong. 2025. https://arxiv.org/abs/2407.15268 Fact-aware multimodal retrieval augmentation for accurate medical radiology report generation . Preprint, arXiv:2407.15268

  64. [77]

    Malki, Khalid Alhasan, Ibraheem Altamimi, Razan Aljarbou, Faisal Bazuhair, Abdulmajeed Alsubaihin, Naif Abdulmajeed, Fatimah S

    Mohamad-Hani Temsah, Fadi Aljamaan, Khalid H. Malki, Khalid Alhasan, Ibraheem Altamimi, Razan Aljarbou, Faisal Bazuhair, Abdulmajeed Alsubaihin, Naif Abdulmajeed, Fatimah S. Alshahrani, Reem Temsah, Turki Alshahrani, Lama Al-Eyadhy, Serin Mohammed Alkhateeb, Basema Saddik, Rabih Halwani, Amr Jamal, Jaffar A. Al-Tawfiq, and Ayman Al-Eyadhy. 2023. https://d...

  65. [78]

    David Vilares and Carlos G \'o mez-Rodr \'i guez. 2019. https://doi.org/10.18653/v1/P19-1092 HEAD - QA : A healthcare dataset for complex reasoning . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 960--966, Florence, Italy. Association for Computational Linguistics

  66. [79]

    Xiao Wang, Mengjue Tan, Qiao Jin, Guangzhi Xiong, Yu Hu, Aidong Zhang, Zhiyong Lu, and Minjia Zhang. 2025. https://doi.org/10.18653/v1/2025.findings-acl.967 M ed C ite: Can language models generate verifiable text for medicine? In Findings of the Association for Computational Linguistics: ACL 2025, pages 18891--18913, Vienna, Austria. Association for Comp...

  67. [80]

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. 2024. https://papers.nips.cc/paper_files/paper/2024/hash/ad236edc564f3e3156e1b2feafb99a24-Abstract-Datasets_and_Benchmarks_Track.html MMLU -pro: A...

  68. [81]

    Benjamin Warner, Ratna Sagari Grandhi, Max Kieffer, Aymane Ouraq, Saurav Panigrahi, Geetu Ambwani, Kunal Bagga, Nikhil Khandekar, Arya Hariharan, Nishant Mishra, Manish Ram, Shamus Sim Zi Yang, Ahmed Essouaied, Adepoju Jeremiah Moyondafoluwa, Robert Scholz, Bofeng Huang, Molly Beavers, Srishti Gureja, Anish Mahishi, Sameed Khan, Maxime Griot, Hunar Batra,...

  69. [82]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. https://arxiv.org/abs/2201.11903 Chain-of-thought prompting elicits reasoning in large language models . Preprint, arXiv:2201.11903

  70. [83]

    Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. 2024. https://doi.org/10.18653/v1/2024.findings-acl.372 Benchmarking retrieval-augmented generation for medicine . In Findings of the Association for Computational Linguistics: ACL 2024, pages 6233--6251, Bangkok, Thailand. Association for Computational Linguistics

  71. [84]

    Hua Yang, Shilong Li, and Teresa Gonçalves. 2024. https://doi.org/10.3390/info15080494 Enhancing biomedical question answering with large language models . Information, 15(8)

  72. [85]

    Arik, and Tomas Pfister

    Xi Ye, Ruoxi Sun, Sercan Ö. Arik, and Tomas Pfister. 2024. https://arxiv.org/abs/2311.09533 Effective large language model adaptation for improved grounding and citation generation . Preprint, arXiv:2311.09533

  73. [86]

    Jaehoon Yun, Jiwoong Sohn, Jungwoo Park, Hyunjae Kim, Xiangru Tang, Yanjun Shao, Yonghoe Koo, Minhyeok Ko, Qingyu Chen, Mark Gerstein, Michael Moor, and Jaewoo Kang. 2025. https://arxiv.org/abs/2506.11474 Med-prm: Medical reasoning models with stepwise, guideline-verified process rewards . Preprint, arXiv:2506.11474

  74. [87]

    Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. 2025. https://arxiv.org/abs/2501.18362 Medxpertqa: Benchmarking expert-level medical reasoning and understanding . arXiv preprint arXiv:2501.18362