MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation
Pith reviewed 2026-05-21 23:06 UTC · model grok-4.3
The pith
A new French medical question answering dataset reveals that large language models perform significantly better on factual recall than on reasoning tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MediQAl introduces a dataset of 32,603 questions from French medical examinations across 41 subjects, with tasks in unique-answer multiple choice, multiple-answer multiple choice, and short open-ended formats, each labeled Understanding or Reasoning, and evaluations of 14 large language models demonstrate a significant performance gap between factual recall and reasoning tasks.
What carries the argument
The labeling of questions as Understanding or Reasoning, which separates evaluation of factual medical knowledge from the ability to reason over clinical scenarios.
Load-bearing premise
Questions taken from French medical examinations accurately reflect real-world clinical scenarios and the Understanding/Reasoning labels assigned to each question are reliable and consistent.
What would settle it
A test showing no performance difference between Understanding and Reasoning tasks across models, or a mismatch between dataset performance and actual medical practice outcomes.
Figures
read the original abstract
This work introduces MediQAl, a French medical question answering dataset designed to evaluate the capabilities of language models in factual medical recall and reasoning over real-world clinical scenarios. MediQAl contains 32,603 questions sourced from French medical examinations across 41 medical subjects. The dataset includes three tasks: (i) Multiple-Choice Question with Unique answer, (ii) Multiple-Choice Question with Multiple answer, and (iii) Open-Ended Question with Short-Answer. Each question is labeled as Understanding or Reasoning, enabling a detailed analysis of models' cognitive capabilities. We validate the MediQAl dataset through extensive evaluation with 14 large language models, including recent reasoning-augmented models, and observe a significant performance gap between factual recall and reasoning tasks. Our evaluation provides a comprehensive benchmark for assessing language models' performance on French medical question answering, addressing a crucial gap in multilingual resources for the medical domain.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MediQAl, a dataset of 32,603 French medical examination questions spanning 41 subjects. Questions are partitioned into three task formats (unique-answer MCQ, multiple-answer MCQ, and open-ended short-answer) and each is labeled Understanding or Reasoning. Evaluation of 14 LLMs, including reasoning-augmented models, is used to demonstrate a significant performance gap between factual-recall and reasoning tasks, positioning the resource as a benchmark for multilingual medical QA.
Significance. A well-validated French medical QA dataset with cognitive labels would address a clear gap in non-English resources and allow targeted diagnosis of recall-versus-reasoning weaknesses in LLMs. The scale and multi-task design are positive; however, the headline gap observation is only as reliable as the Understanding/Reasoning partition.
major comments (1)
- The manuscript does not report inter-annotator agreement, explicit annotation guidelines, or any validation study for the binary Understanding/Reasoning labels assigned to all 32,603 questions. Because the central empirical claim is a performance gap between these two categories, systematic mislabeling (e.g., difficult factual items placed in Reasoning) would directly produce or inflate the observed gap even if models treat both sets similarly.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback. We address the concern about the validation of the Understanding/Reasoning labels below.
read point-by-point responses
-
Referee: The manuscript does not report inter-annotator agreement, explicit annotation guidelines, or any validation study for the binary Understanding/Reasoning labels assigned to all 32,603 questions. Because the central empirical claim is a performance gap between these two categories, systematic mislabeling (e.g., difficult factual items placed in Reasoning) would directly produce or inflate the observed gap even if models treat both sets similarly.
Authors: We agree that documenting the label assignment process is essential for supporting the central claim of a performance gap. The labels were produced by medical domain experts who classified each question according to whether it primarily requires factual recall (Understanding) or multi-step clinical reasoning and knowledge application (Reasoning). We acknowledge that the original manuscript omitted the annotation guidelines, inter-annotator agreement statistics, and any validation details. In the revised version we will add a dedicated subsection that (i) reproduces the annotation guidelines in full, (ii) reports Cohen’s kappa on a stratified sample of 1,000 questions independently labeled by two experts, and (iii) provides representative examples from each category. These additions will allow readers to evaluate the reliability of the partition directly. revision: yes
Circularity Check
No circularity: empirical dataset release with independent evaluation
full rationale
The paper introduces MediQAl as a new dataset of 32,603 questions from French medical exams, assigns Understanding/Reasoning labels, and reports empirical performance gaps from evaluating 14 LLMs. No mathematical derivations, parameter fitting, predictions by construction, or self-citation chains are present. The central claim is a direct observational result from model runs on the labeled data, which does not reduce to any definitional or fitted input by construction. Label reliability is a validity concern but not a circularity issue under the defined patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
An investigation of eval- uation methods in automatic medical note generation. 4http://www.idris.fr/media/jean-zay/jean-zay-conso- heure-calcul.pdf In Findings of the Association for Computational Linguistics: ACL 2023, pages 2575–2588, Toronto, Canada. Association for Computational Linguistics. Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu,...
work page 2023
-
[2]
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Huatuogpt-o1, towards medical complex reasoning with llms. Preprint, arXiv:2412.18925. Google DeepMind
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
https: //blog.google/technology/google-deepmind/ google-gemini-ai-update-december-2024/
Introducing Gemini 2.0: our new ai model for the agentic era. https: //blog.google/technology/google-deepmind/ google-gemini-ai-update-december-2024/ . DeepSeek-AI
work page 2024
-
[4]
Deepseek-v3 technical report. Preprint, arXiv:2412.19437. DeepSeek-AI
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Deepseek-r1: Incentivizing rea- soning capability in llms via reinforcement learning. Preprint, arXiv:2501.12948. Krishno Dey, Prerona Tarannum, Md. Arid Hasan, Im- ran Razzak, and Usman Naseem
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Better to ask in english: Evaluation of large language models on english, low-resource and cross-lingual settings. Preprint, arXiv:2410.13153. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, Amy Yang, and Angela Fan et al
-
[7]
The llama 3 herd of models. Preprint, arXiv:2407.21783. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
A survey on llm-as-a-judge. Preprint, arXiv:2411.15594. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Mistral 7b. Preprint, arXiv:2310.06825. Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
In Pro- ceedings of the ACM Web Conference 2024, WWW ’24, page 2627–2638, New York, NY , USA
Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries. In Pro- ceedings of the ACM Web Conference 2024, WWW ’24, page 2627–2638, New York, NY , USA. Associa- tion for Computing Machinery. Yanis Labrak, Adrien Bazoge, Richard Dufour, Beatrice Daille, Pierre-Antoine Gourraud, Emmanuel Morin, and Mickael Rouvier
work page 2024
-
[11]
BioMistral: A collection of open- source pretrained large language models for medical domains. In Findings of the Association for Compu- tational Linguistics: ACL 2024 , pages 5848–5864, Bangkok, Thailand. Association for Computational Linguistics. Jing Li, Shangping Zhong, and Kaizhi Chen
work page 2024
-
[12]
MLEC-QA: A Chinese Multi-Choice Biomedical Question Answering Dataset. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8862–8874, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Chin-Yew Lin
work page 2021
-
[13]
Estimating the carbon foot- print of bloom, a 176b parameter language model. Preprint, arXiv:2211.02001. Saeel Sandeep Nachane, Ojas Gramopadhye, Prateek Chanda, Ganesh Ramakrishnan, Kshitij Sharad Jad- hav, Yatin Nandwani, Dinesh Raghu, and Sachin- dra Joshi
-
[14]
Few shot chain-of-thought driven reasoning to prompt llms for open ended medical question answering. Preprint, arXiv:2403.04890. OpenAI. 2024a. Gpt-4o system card. Preprint, arXiv:2410.21276. OpenAI. 2024b. Openai o1 system card. Preprint, arXiv:2412.16720. Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu
-
[15]
Medical dialogue system: A survey of categories, methods, evaluation and challenges. In Findings of the Association for Computational Linguistics: ACL 2024, pages 2840– 2861, Bangkok, Thailand. Association for Computa- tional Linguistics. Karen Sparck Jones
work page 2024
-
[16]
Large language model benchmarks in medical tasks. Preprint, arXiv:2410.21348. Hongzhou Yu, Tianhao Cheng, Ying Cheng, and Rui Feng
-
[17]
Finemedlm-o1: Enhancing the medical reasoning ability of llm from supervised fine-tuning to test-time training. Preprint, arXiv:2501.09213. Kaiyan Zhang, Sihang Zeng, Ermo Hua, Ning Ding, Zhang-Ren Chen, Zhiyuan Ma, Haoxin Li, Ganqu Cui, Biqing Qi, Xuekai Zhu, Xingtai Lv, Hu Jinfang, Zhiyuan Liu, and Bowen Zhou
-
[18]
Ultramedi- cal: Building specialized generalists in biomedicine. Preprint, arXiv:2406.03949. Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi
-
[19]
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
Medxpertqa: Benchmarking expert-level medical reasoning and understanding. Preprint, arXiv:2501.18362. A Prompts A.1 Medical Subjects Prompt Medical Subjects Annotation Prompt You are an experienced medical doctor and independent practitioner. Your task will be to label a clinical scenario according to the medical subject it corresponds to. You will be gi...
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.