pith. machine review for the scientific record. sign in

arxiv: 2604.14306 · v2 · submitted 2026-04-15 · 💻 cs.CL · cs.AI

Recognition: unknown

EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation

Alessandra Piscitelli, Alessandro Tosi, Alessia Longo, Antonio Cristiano, Bianca Destro Castaniti, Chiara Battipaglia, Federico Felizzi, Francesco Andrea Causio, Giulia Vojvodic, Lorenzo De Mori, Luigi De Angelis, Manuel Del Medico, Marcello Di Pumpo, Mariapia Vassalli, Melissa Sawaya, Michele Ferramola, Nicol\`o Scarsi, Olivia Riccomi, Pietro Eric Risuleo, Vittorio De Vita

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multilingual medical QAmultimodal LLM evaluationEuropean medical examscross-lingual transfermedical AI benchmarkcontamination-resistant datasetzero-shot prompting
0
0 comments X

The pith

EuropeMedQA creates the first multilingual multimodal medical exam dataset from official regulatory tests in Italy France Spain and Portugal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a protocol to build EuropeMedQA as a dataset of medical examination questions drawn directly from official sources in four European countries. It targets the known drop in large language model performance when moving from English medical tests to other languages or tasks that include images. The authors detail a curation process and automated translation steps that follow established guidelines for data quality and responsible AI research. A reader would care because this benchmark can measure how well models handle real European clinical content without relying on English-only data. If the approach holds, it would give developers a standard way to improve medical AI for non-English settings and visual diagnosis.

Core claim

EuropeMedQA is the first comprehensive multilingual and multimodal medical examination dataset sourced from official regulatory exams in Italy, France, Spain, and Portugal. A rigorous curation process and automated translation pipeline aligned with FAIR data principles and SPIRIT-AI guidelines produce the questions. Contemporary multimodal LLMs are then evaluated in a zero-shot strictly constrained prompting setup to measure cross-lingual transfer and visual reasoning, yielding a contamination-resistant benchmark that reflects the complexity of European clinical practices.

What carries the argument

The EuropeMedQA dataset, built from official exams via curation and automated translation, supplies the mechanism for consistent zero-shot testing of multimodal LLMs on cross-lingual medical content.

If this is right

  • LLM performance can be compared directly across languages on matched medical questions.
  • Visual diagnostic reasoning can be isolated and measured using the image-based items.
  • Medical AI development gains a concrete standard for generalizability beyond English data.
  • The protocol demonstrates how to apply FAIR and SPIRIT-AI rules to medical evaluation datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar curation pipelines could be applied to create benchmarks for other medical specialties or additional languages.
  • The dataset may expose language-specific gaps in how models handle regional medical terminology.
  • Pairing EuropeMedQA results with real hospital case data could test whether exam performance predicts practical clinical utility.

Load-bearing premise

The official exam questions and their translations will retain the full complexity and accuracy of real European clinical practice without introducing material errors or biases.

What would settle it

A side-by-side test showing that leading multimodal LLMs achieve the same accuracy on EuropeMedQA as on English medical benchmarks with no performance drop, or direct evidence of translation errors in medical terms, would disprove the need for this specific new benchmark.

Figures

Figures reproduced from arXiv: 2604.14306 by Alessandra Piscitelli, Alessandro Tosi, Alessia Longo, Antonio Cristiano, Bianca Destro Castaniti, Chiara Battipaglia, Federico Felizzi, Francesco Andrea Causio, Giulia Vojvodic, Lorenzo De Mori, Luigi De Angelis, Manuel Del Medico, Marcello Di Pumpo, Mariapia Vassalli, Melissa Sawaya, Michele Ferramola, Nicol\`o Scarsi, Olivia Riccomi, Pietro Eric Risuleo, Vittorio De Vita.

Figure 1
Figure 1. Figure 1: Study flow diagram. After dataset processing, three parallel tracks are produced: an [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

While Large Language Models (LLMs) have demonstrated high proficiency on English-centric medical examinations, their performance often declines when faced with non-English languages and multimodal diagnostic tasks. This study protocol describes the development of EuropeMedQA, the first comprehensive, multilingual, and multimodal medical examination dataset sourced from official regulatory exams in Italy, France, Spain, and Portugal. Following FAIR data principles and SPIRIT-AI guidelines, we describe a rigorous curation process and an automated translation pipeline for comparative analysis. We evaluate contemporary multimodal LLMs using a zero-shot, strictly constrained prompting strategy to assess cross-lingual transfer and visual reasoning. EuropeMedQA aims to provide a contamination-resistant benchmark that reflects the complexity of European clinical practices and fosters the development of more generalizable medical AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a study protocol for EuropeMedQA, described as the first comprehensive multilingual and multimodal medical examination dataset sourced from official regulatory exams in Italy, France, Spain, and Portugal. It outlines adherence to FAIR principles and SPIRIT-AI guidelines, a rigorous curation process, an automated translation pipeline for cross-lingual data, and a planned zero-shot evaluation of contemporary multimodal LLMs to assess cross-lingual transfer and visual reasoning, with the goal of creating a contamination-resistant benchmark reflecting European clinical practices.

Significance. If the protocol is executed with sufficient rigor to produce high-fidelity data, EuropeMedQA would address a clear gap in non-English and multimodal medical benchmarks for LLMs, enabling more robust evaluation of cross-lingual capabilities and visual reasoning in clinical contexts. The explicit commitment to FAIR data principles and SPIRIT-AI guidelines is a strength that supports potential reusability and transparency.

major comments (3)
  1. [Abstract and translation pipeline description] Abstract and the section describing the automated translation pipeline: no specific machine translation system, post-editing protocol, or quantitative/human validation metrics (e.g., terminology accuracy on medical terms) are provided. This is load-bearing for the central claim that the resulting dataset will preserve diagnostic nuance and accurately reflect source exam content across languages.
  2. [Curation process and contamination resistance] The section on curation and contamination resistance: concrete methods for detecting and preventing data contamination (e.g., temporal cutoffs, overlap checks against common LLM training corpora, or source verification) are not specified, undermining the claim that EuropeMedQA will serve as a reliable contamination-resistant benchmark.
  3. [Evaluation strategy] The evaluation plan section: the 'strictly constrained prompting strategy' lacks detail on the exact constraints, language-specific adaptations, or how visual reasoning will be isolated from textual cues, which is necessary to support claims about assessing cross-lingual transfer and multimodal performance.
minor comments (2)
  1. [Abstract] The abstract claims 'the first comprehensive' dataset; consider qualifying this with respect to prior multilingual medical QA resources to avoid overstatement.
  2. [Data availability] Ensure all planned data releases include explicit licensing and access details to fully align with the stated FAIR principles.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on our study protocol manuscript for EuropeMedQA. The comments identify important areas where additional specificity will strengthen the description of our planned methods. We address each major comment below and commit to revisions that incorporate the requested details without altering the core protocol design.

read point-by-point responses
  1. Referee: [Abstract and translation pipeline description] Abstract and the section describing the automated translation pipeline: no specific machine translation system, post-editing protocol, or quantitative/human validation metrics (e.g., terminology accuracy on medical terms) are provided. This is load-bearing for the central claim that the resulting dataset will preserve diagnostic nuance and accurately reflect source exam content across languages.

    Authors: We agree that the translation pipeline description in the current protocol is high-level and lacks the requested implementation specifics. As this is a study protocol outlining planned work rather than completed execution, we will revise the relevant section (and update the abstract accordingly) to specify the intended machine translation system (a hybrid of domain-adapted open-source models and commercial APIs with medical terminology fine-tuning), a post-editing protocol involving native-speaking medical professionals for each target language, and quantitative validation metrics including terminology accuracy via expert review, back-translation consistency checks, and human evaluation scores for diagnostic nuance preservation. These additions will directly support the claims about cross-lingual fidelity. revision: yes

  2. Referee: [Curation process and contamination resistance] The section on curation and contamination resistance: concrete methods for detecting and preventing data contamination (e.g., temporal cutoffs, overlap checks against common LLM training corpora, or source verification) are not specified, undermining the claim that EuropeMedQA will serve as a reliable contamination-resistant benchmark.

    Authors: The referee correctly notes that concrete contamination-resistance methods were not detailed in the protocol. We will revise the curation section to explicitly describe our planned safeguards: temporal cutoffs restricting inclusion to exams released after 2023 (post-dating major LLM training cutoffs), systematic n-gram and embedding-based overlap checks against known public training corpora (e.g., via tools like those used in prior benchmark papers), and direct source verification through official regulatory body documentation and metadata. These additions will substantiate the contamination-resistant benchmark claim. revision: yes

  3. Referee: [Evaluation strategy] The evaluation plan section: the 'strictly constrained prompting strategy' lacks detail on the exact constraints, language-specific adaptations, or how visual reasoning will be isolated from textual cues, which is necessary to support claims about assessing cross-lingual transfer and multimodal performance.

    Authors: We acknowledge that the evaluation plan requires more granular description to be fully convincing. In the revised manuscript, we will expand this section to define the exact constraints (fixed zero-shot templates with no chain-of-thought, strict output format enforcement via regex validation, and prohibition on external tool use), language-specific adaptations (native-language prompts and instructions for each of the four languages), and isolation of visual reasoning (systematic ablations comparing image+text vs. text-only inputs, plus controlled perturbations of visual elements). These details will better ground the claims on cross-lingual transfer and multimodal capabilities. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive dataset protocol with no derivations or self-referential reductions

full rationale

This is a study protocol paper outlining dataset construction, curation, translation pipeline, and evaluation strategy for EuropeMedQA. It contains no equations, fitted parameters, predictions of derived quantities, or mathematical derivations. The central claims concern data sourcing from official exams and application of FAIR/SPIRIT-AI principles; these are procedural descriptions, not self-defining or fitted-input reductions. No self-citation chains, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The paper is self-contained as a methods description without any step that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a study protocol for dataset creation with no mathematical derivations, fitted parameters, or postulated entities. The contribution rests on the planned data collection and translation methods rather than any axiomatic or parametric structure.

pith-pipeline@v0.9.0 · 5514 in / 1052 out tokens · 37199 ms · 2026-05-10T13:59:21.200909+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

    Alonso I, Oronoz M, and Agerri R. MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering. Artificial Intelligence in Medicine 2024 Sep; 155:102938.doi:10.1016/j.artmed.2024.102938

  2. [2]

    Benchmarking large lan- guage models on answering and explaining challenging medical questions

    Chen H et al. Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions. 2024.doi:10.48550/ARXIV.2402.18060

  3. [3]

    Evaluating LLMs in Medicine: A Call for Rigor, Transparency

    Alwakeel M et al. Evaluating LLMs in Medicine: A Call for Rigor, Transparency. 2025.doi:10.48550/ARXIV.2507.08916

  4. [4]

    Are Large Vision Language Models Truly Grounded in Medical Im- ages? Evidence from Italian Clinical Visual Question Answering

    Felizzi F et al. Are Large Vision Language Models Truly Grounded in Medical Im- ages? Evidence from Italian Clinical Visual Question Answering. 2025.doi:10. 48550/ARXIV.2511.19220

  5. [5]

    Polish-English Medical Knowledge Transfer: A New Benchmark and Results

    Grzybowski L et al. Polish-English Medical Knowledge Transfer: A New Benchmark and Results. 2024.doi:10.48550/ARXIV.2412.00559

  6. [6]

    A novel single-crystal & single-pass source for polarisation- and colour-entangled photon pairs,

    Rosol M et al. Evaluation of the Performance of GPT-3.5 and GPT-4 on the Polish MedicalFinalExamination.ScientificReports 2023 Nov;13.doi:10.1038/s41598- 023-46995-z

  7. [7]

    Valutazione One-Shot di Mistral7B sul Nuovo Benchmark EuropeMedQA

    Riccomi O, Causio FA, De Vita V, et al. Valutazione One-Shot di Mistral7B sul Nuovo Benchmark EuropeMedQA. Recenti Progressi in Medicina 2025 Oct; 116. doi:10.1701/4573.45804

  8. [8]

    A General Language Assistant as a Laboratory for Alignment

    Askell A, Bai Y, Chen A, et al. A General Language Assistant as a Laboratory for Alignment. 2021.doi:10.48550/ARXIV.2112.00861

  9. [9]

    RealMedQA: A Pilot Biomedical Question Answering Dataset Containing Realistic Clinical Questions

    Kell G, Roberts A, Umansky S, et al. RealMedQA: A Pilot Biomedical Question Answering Dataset Containing Realistic Clinical Questions. 2024.doi:10.48550/ ARXIV.2408.08624

  10. [10]

    MulMed: Addressing Multiple Medical Tasks Utilizing Large Language Models

    Cheng N, Li F, and Huang L. MulMed: Addressing Multiple Medical Tasks Utilizing Large Language Models. 2024 Oct.doi:10.21203/rs.3.rs-4967279/v1

  11. [11]

    A History of Artificial Intelligence

    Grzybowski A, Pawlikowska-Lagod K, and Lambert WC. A History of Artificial Intelligence. Clinics in Dermatology 2024 May; 42:221–9.doi:10 . 1016 / j . clindermatol.2023.12.016

  12. [12]

    A Practical Guide to FAIR Data Manage- ment in the Age of Multi-OMICS and AI

    Mugahid D, Lyon J, Demurjian C, et al. A Practical Guide to FAIR Data Manage- ment in the Age of Multi-OMICS and AI. Frontiers in Immunology 2025 Jan; 15. doi:10.3389/fimmu.2024.1439434

  13. [13]

    Denniston, Melanie J

    Cruz Rivera S, Liu X, Chan AW, et al. Guidelines for Clinical Trial Protocols for Interventions Involving Artificial Intelligence: The SPIRIT-AI Extension. Nature Medicine 2020 Sep; 26:1351–63.doi:10.1038/s41591-020-1037-7

  14. [14]

    The STARD-AI Reporting Guideline for Di- agnostic Accuracy Studies Using Artificial Intelligence

    Sounderajah V, Guni A, Liu X, et al. The STARD-AI Reporting Guideline for Di- agnostic Accuracy Studies Using Artificial Intelligence. Nature Medicine 2025 Sep; 31:3283–9.doi:10.1038/s41591-025-03953-8 13