MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

Adrien Bazoge

arxiv: 2507.20917 · v1 · pith:ZGLSEDQWnew · submitted 2025-07-28 · 💻 cs.CL · cs.AI

MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

Adrien Bazoge This is my paper

Pith reviewed 2026-05-21 23:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords French medical question answeringlanguage model evaluationmedical reasoningreasoning vs recallmultilingual medical datasetclinical scenarioslarge language models

0 comments

The pith

A new French medical question answering dataset reveals that large language models perform significantly better on factual recall than on reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates MediQAl, a collection of over 32,000 questions from French medical exams in 41 subjects. Questions are presented in multiple-choice or open-ended formats and labeled to test either basic understanding or deeper reasoning. When 14 large language models are tested on the dataset, they show markedly lower scores on reasoning tasks than on recall tasks. The work fills a gap by providing a benchmark for medical AI in French. This separation of tasks helps identify where models need improvement for practical medical use.

Core claim

MediQAl introduces a dataset of 32,603 questions from French medical examinations across 41 subjects, with tasks in unique-answer multiple choice, multiple-answer multiple choice, and short open-ended formats, each labeled Understanding or Reasoning, and evaluations of 14 large language models demonstrate a significant performance gap between factual recall and reasoning tasks.

What carries the argument

The labeling of questions as Understanding or Reasoning, which separates evaluation of factual medical knowledge from the ability to reason over clinical scenarios.

Load-bearing premise

Questions taken from French medical examinations accurately reflect real-world clinical scenarios and the Understanding/Reasoning labels assigned to each question are reliable and consistent.

What would settle it

A test showing no performance difference between Understanding and Reasoning tasks across models, or a mismatch between dataset performance and actual medical practice outcomes.

Figures

Figures reproduced from arXiv: 2507.20917 by Adrien Bazoge.

**Figure 2.** Figure 2: Performance of three groups of models (OpenAI, DeepSeek and LLama) on all subsets of MediQAl. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Prompt for Medical Subjects Annotation in the MediQAl-OEQ subset. The list of medical subjects in the [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt for labeling questions as Reasoning or Understanding. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt for zero-shot evaluation of LLMs on the MCQU subset. The clinical scenario is optional in the [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt for zero-shot evaluation of LLMs on the MCQM subset. The clinical scenario is optional in the [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt for zero-shot evaluation of LLMs on the OEQ subset. The clinical scenario is optional in the [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt for LLM-as-judge evaluation of LLMs on the OEQ subset. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

This work introduces MediQAl, a French medical question answering dataset designed to evaluate the capabilities of language models in factual medical recall and reasoning over real-world clinical scenarios. MediQAl contains 32,603 questions sourced from French medical examinations across 41 medical subjects. The dataset includes three tasks: (i) Multiple-Choice Question with Unique answer, (ii) Multiple-Choice Question with Multiple answer, and (iii) Open-Ended Question with Short-Answer. Each question is labeled as Understanding or Reasoning, enabling a detailed analysis of models' cognitive capabilities. We validate the MediQAl dataset through extensive evaluation with 14 large language models, including recent reasoning-augmented models, and observe a significant performance gap between factual recall and reasoning tasks. Our evaluation provides a comprehensive benchmark for assessing language models' performance on French medical question answering, addressing a crucial gap in multilingual resources for the medical domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MediQAl supplies a needed French medical QA dataset but the reasoning gap claim depends on labels without reported validation.

read the letter

The main thing here is a new dataset of 32,603 French medical exam questions labeled as Understanding or Reasoning, formatted in three task styles, with benchmarks on 14 LLMs that show weaker results on the reasoning slice. It is the first resource of this scale for French medical QA. That directly tackles the shortage of non-English medical benchmarks, which matters for work on multilingual clinical models. The three formats—single-answer multiple choice, multi-answer multiple choice, and open-ended short answers—give a practical way to compare model behavior across output types while staying inside the medical domain. Running recent reasoning-augmented models alongside standard ones and reporting the performance difference adds usable numbers for anyone tracking progress in this area. The evaluation is empirical and reported plainly, which fits what a dataset paper should do. The soft spot is the labeling. The split into Understanding versus Reasoning drives the headline observation, yet the available description gives no inter-annotator agreement figures, annotation guidelines, or expert validation step. If the labels contain systematic noise, the gap could be an artifact of how questions were tagged rather than a clean signal about model capability. Exam questions are also stylized and may not track real clinical reasoning as closely as claimed. This paper is mainly for researchers who need test sets in French or other lower-resource languages for medical NLP. It supplies a concrete starting point they can run or extend. It shows clear thinking in spotting the resource gap and setting up the tasks. I would bring it to a reading group to discuss the labeling method and how it lines up with English medical QA sets. It deserves peer review because new language-specific datasets like this can support real follow-on work once the annotation details are checked.

Referee Report

1 major / 0 minor

Summary. The paper introduces MediQAl, a dataset of 32,603 French medical examination questions spanning 41 subjects. Questions are partitioned into three task formats (unique-answer MCQ, multiple-answer MCQ, and open-ended short-answer) and each is labeled Understanding or Reasoning. Evaluation of 14 LLMs, including reasoning-augmented models, is used to demonstrate a significant performance gap between factual-recall and reasoning tasks, positioning the resource as a benchmark for multilingual medical QA.

Significance. A well-validated French medical QA dataset with cognitive labels would address a clear gap in non-English resources and allow targeted diagnosis of recall-versus-reasoning weaknesses in LLMs. The scale and multi-task design are positive; however, the headline gap observation is only as reliable as the Understanding/Reasoning partition.

major comments (1)

The manuscript does not report inter-annotator agreement, explicit annotation guidelines, or any validation study for the binary Understanding/Reasoning labels assigned to all 32,603 questions. Because the central empirical claim is a performance gap between these two categories, systematic mislabeling (e.g., difficult factual items placed in Reasoning) would directly produce or inflate the observed gap even if models treat both sets similarly.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. We address the concern about the validation of the Understanding/Reasoning labels below.

read point-by-point responses

Referee: The manuscript does not report inter-annotator agreement, explicit annotation guidelines, or any validation study for the binary Understanding/Reasoning labels assigned to all 32,603 questions. Because the central empirical claim is a performance gap between these two categories, systematic mislabeling (e.g., difficult factual items placed in Reasoning) would directly produce or inflate the observed gap even if models treat both sets similarly.

Authors: We agree that documenting the label assignment process is essential for supporting the central claim of a performance gap. The labels were produced by medical domain experts who classified each question according to whether it primarily requires factual recall (Understanding) or multi-step clinical reasoning and knowledge application (Reasoning). We acknowledge that the original manuscript omitted the annotation guidelines, inter-annotator agreement statistics, and any validation details. In the revised version we will add a dedicated subsection that (i) reproduces the annotation guidelines in full, (ii) reports Cohen’s kappa on a stratified sample of 1,000 questions independently labeled by two experts, and (iii) provides representative examples from each category. These additions will allow readers to evaluate the reliability of the partition directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset release with independent evaluation

full rationale

The paper introduces MediQAl as a new dataset of 32,603 questions from French medical exams, assigns Understanding/Reasoning labels, and reports empirical performance gaps from evaluating 14 LLMs. No mathematical derivations, parameter fitting, predictions by construction, or self-citation chains are present. The central claim is a direct observational result from model runs on the labeled data, which does not reduce to any definitional or fitted input by construction. Label reliability is a validity concern but not a circularity issue under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a new dataset constructed from existing French medical examinations without introducing fitted parameters, new axioms, or invented entities; all content is drawn from publicly available exam sources.

pith-pipeline@v0.9.0 · 5679 in / 1025 out tokens · 38139 ms · 2026-05-21T23:06:46.202126+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 7 internal anchors

[1]

4http://www.idris.fr/media/jean-zay/jean-zay-conso- heure-calcul.pdf In Findings of the Association for Computational Linguistics: ACL 2023, pages 2575–2588, Toronto, Canada

An investigation of eval- uation methods in automatic medical note generation. 4http://www.idris.fr/media/jean-zay/jean-zay-conso- heure-calcul.pdf In Findings of the Association for Computational Linguistics: ACL 2023, pages 2575–2588, Toronto, Canada. Association for Computational Linguistics. Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu,...

work page 2023
[2]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Huatuogpt-o1, towards medical complex reasoning with llms. Preprint, arXiv:2412.18925. Google DeepMind

work page internal anchor Pith review Pith/arXiv arXiv
[3]

https: //blog.google/technology/google-deepmind/ google-gemini-ai-update-december-2024/

Introducing Gemini 2.0: our new ai model for the agentic era. https: //blog.google/technology/google-deepmind/ google-gemini-ai-update-december-2024/ . DeepSeek-AI

work page 2024
[4]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report. Preprint, arXiv:2412.19437. DeepSeek-AI

work page internal anchor Pith review Pith/arXiv arXiv
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing rea- soning capability in llms via reinforcement learning. Preprint, arXiv:2501.12948. Krishno Dey, Prerona Tarannum, Md. Arid Hasan, Im- ran Razzak, and Usman Naseem

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Preprint, arXiv:2410.13153

Better to ask in english: Evaluation of large language models on english, low-resource and cross-lingual settings. Preprint, arXiv:2410.13153. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, Amy Yang, and Angela Fan et al

work page arXiv
[7]

The Llama 3 Herd of Models

The llama 3 herd of models. Preprint, arXiv:2407.21783. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo

work page internal anchor Pith review Pith/arXiv arXiv
[8]

A Survey on LLM-as-a-Judge

A survey on llm-as-a-judge. Preprint, arXiv:2411.15594. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Mistral 7B

Mistral 7b. Preprint, arXiv:2310.06825. Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits

work page internal anchor Pith review Pith/arXiv arXiv
[10]

In Pro- ceedings of the ACM Web Conference 2024, WWW ’24, page 2627–2638, New York, NY , USA

Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries. In Pro- ceedings of the ACM Web Conference 2024, WWW ’24, page 2627–2638, New York, NY , USA. Associa- tion for Computing Machinery. Yanis Labrak, Adrien Bazoge, Richard Dufour, Beatrice Daille, Pierre-Antoine Gourraud, Emmanuel Morin, and Mickael Rouvier

work page 2024
[11]

In Findings of the Association for Compu- tational Linguistics: ACL 2024 , pages 5848–5864, Bangkok, Thailand

BioMistral: A collection of open- source pretrained large language models for medical domains. In Findings of the Association for Compu- tational Linguistics: ACL 2024 , pages 5848–5864, Bangkok, Thailand. Association for Computational Linguistics. Jing Li, Shangping Zhong, and Kaizhi Chen

work page 2024
[12]

In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8862–8874, Online and Punta Cana, Dominican Republic

MLEC-QA: A Chinese Multi-Choice Biomedical Question Answering Dataset. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8862–8874, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Chin-Yew Lin

work page 2021
[13]

Preprint, arXiv:2211.02001

Estimating the carbon foot- print of bloom, a 176b parameter language model. Preprint, arXiv:2211.02001. Saeel Sandeep Nachane, Ojas Gramopadhye, Prateek Chanda, Ganesh Ramakrishnan, Kshitij Sharad Jad- hav, Yatin Nandwani, Dinesh Raghu, and Sachin- dra Joshi

work page arXiv
[14]

Preprint, arXiv:2403.04890

Few shot chain-of-thought driven reasoning to prompt llms for open ended medical question answering. Preprint, arXiv:2403.04890. OpenAI. 2024a. Gpt-4o system card. Preprint, arXiv:2410.21276. OpenAI. 2024b. Openai o1 system card. Preprint, arXiv:2412.16720. Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu

work page arXiv
[15]

In Findings of the Association for Computational Linguistics: ACL 2024, pages 2840– 2861, Bangkok, Thailand

Medical dialogue system: A survey of categories, methods, evaluation and challenges. In Findings of the Association for Computational Linguistics: ACL 2024, pages 2840– 2861, Bangkok, Thailand. Association for Computa- tional Linguistics. Karen Sparck Jones

work page 2024
[16]

Preprint, arXiv:2410.21348

Large language model benchmarks in medical tasks. Preprint, arXiv:2410.21348. Hongzhou Yu, Tianhao Cheng, Ying Cheng, and Rui Feng

work page arXiv
[17]

Preprint, arXiv:2501.09213

Finemedlm-o1: Enhancing the medical reasoning ability of llm from supervised fine-tuning to test-time training. Preprint, arXiv:2501.09213. Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, Xingtai Lv, Hu Jinfang, Zhiyuan Liu, and Bowen Zhou

work page arXiv
[18]

Preprint, arXiv:2406.03949

Ultramedi- cal: Building specialized generalists in biomedicine. Preprint, arXiv:2406.03949. Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi

work page arXiv
[19]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Medxpertqa: Benchmarking expert-level medical reasoning and understanding. Preprint, arXiv:2501.18362. A Prompts A.1 Medical Subjects Prompt Medical Subjects Annotation Prompt You are an experienced medical doctor and independent practitioner. Your task will be to label a clinical scenario according to the medical subject it corresponds to. You will be gi...

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

4http://www.idris.fr/media/jean-zay/jean-zay-conso- heure-calcul.pdf In Findings of the Association for Computational Linguistics: ACL 2023, pages 2575–2588, Toronto, Canada

An investigation of eval- uation methods in automatic medical note generation. 4http://www.idris.fr/media/jean-zay/jean-zay-conso- heure-calcul.pdf In Findings of the Association for Computational Linguistics: ACL 2023, pages 2575–2588, Toronto, Canada. Association for Computational Linguistics. Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu,...

work page 2023

[2] [2]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Huatuogpt-o1, towards medical complex reasoning with llms. Preprint, arXiv:2412.18925. Google DeepMind

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

https: //blog.google/technology/google-deepmind/ google-gemini-ai-update-december-2024/

Introducing Gemini 2.0: our new ai model for the agentic era. https: //blog.google/technology/google-deepmind/ google-gemini-ai-update-december-2024/ . DeepSeek-AI

work page 2024

[4] [4]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report. Preprint, arXiv:2412.19437. DeepSeek-AI

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing rea- soning capability in llms via reinforcement learning. Preprint, arXiv:2501.12948. Krishno Dey, Prerona Tarannum, Md. Arid Hasan, Im- ran Razzak, and Usman Naseem

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Preprint, arXiv:2410.13153

Better to ask in english: Evaluation of large language models on english, low-resource and cross-lingual settings. Preprint, arXiv:2410.13153. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, Amy Yang, and Angela Fan et al

work page arXiv

[7] [7]

The Llama 3 Herd of Models

The llama 3 herd of models. Preprint, arXiv:2407.21783. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

A Survey on LLM-as-a-Judge

A survey on llm-as-a-judge. Preprint, arXiv:2411.15594. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Mistral 7B

Mistral 7b. Preprint, arXiv:2310.06825. Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

In Pro- ceedings of the ACM Web Conference 2024, WWW ’24, page 2627–2638, New York, NY , USA

Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries. In Pro- ceedings of the ACM Web Conference 2024, WWW ’24, page 2627–2638, New York, NY , USA. Associa- tion for Computing Machinery. Yanis Labrak, Adrien Bazoge, Richard Dufour, Beatrice Daille, Pierre-Antoine Gourraud, Emmanuel Morin, and Mickael Rouvier

work page 2024

[11] [11]

In Findings of the Association for Compu- tational Linguistics: ACL 2024 , pages 5848–5864, Bangkok, Thailand

BioMistral: A collection of open- source pretrained large language models for medical domains. In Findings of the Association for Compu- tational Linguistics: ACL 2024 , pages 5848–5864, Bangkok, Thailand. Association for Computational Linguistics. Jing Li, Shangping Zhong, and Kaizhi Chen

work page 2024

[12] [12]

In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8862–8874, Online and Punta Cana, Dominican Republic

MLEC-QA: A Chinese Multi-Choice Biomedical Question Answering Dataset. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8862–8874, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Chin-Yew Lin

work page 2021

[13] [13]

Preprint, arXiv:2211.02001

Estimating the carbon foot- print of bloom, a 176b parameter language model. Preprint, arXiv:2211.02001. Saeel Sandeep Nachane, Ojas Gramopadhye, Prateek Chanda, Ganesh Ramakrishnan, Kshitij Sharad Jad- hav, Yatin Nandwani, Dinesh Raghu, and Sachin- dra Joshi

work page arXiv

[14] [14]

Preprint, arXiv:2403.04890

Few shot chain-of-thought driven reasoning to prompt llms for open ended medical question answering. Preprint, arXiv:2403.04890. OpenAI. 2024a. Gpt-4o system card. Preprint, arXiv:2410.21276. OpenAI. 2024b. Openai o1 system card. Preprint, arXiv:2412.16720. Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu

work page arXiv

[15] [15]

In Findings of the Association for Computational Linguistics: ACL 2024, pages 2840– 2861, Bangkok, Thailand

Medical dialogue system: A survey of categories, methods, evaluation and challenges. In Findings of the Association for Computational Linguistics: ACL 2024, pages 2840– 2861, Bangkok, Thailand. Association for Computa- tional Linguistics. Karen Sparck Jones

work page 2024

[16] [16]

Preprint, arXiv:2410.21348

Large language model benchmarks in medical tasks. Preprint, arXiv:2410.21348. Hongzhou Yu, Tianhao Cheng, Ying Cheng, and Rui Feng

work page arXiv

[17] [17]

Preprint, arXiv:2501.09213

Finemedlm-o1: Enhancing the medical reasoning ability of llm from supervised fine-tuning to test-time training. Preprint, arXiv:2501.09213. Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, Xingtai Lv, Hu Jinfang, Zhiyuan Liu, and Bowen Zhou

work page arXiv

[18] [18]

Preprint, arXiv:2406.03949

Ultramedi- cal: Building specialized generalists in biomedicine. Preprint, arXiv:2406.03949. Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi

work page arXiv

[19] [19]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Medxpertqa: Benchmarking expert-level medical reasoning and understanding. Preprint, arXiv:2501.18362. A Prompts A.1 Medical Subjects Prompt Medical Subjects Annotation Prompt You are an experienced medical doctor and independent practitioner. Your task will be to label a clinical scenario according to the medical subject it corresponds to. You will be gi...

work page internal anchor Pith review Pith/arXiv arXiv