pith. sign in

arxiv: 2605.16215 · v2 · pith:RGJBW7T4new · submitted 2026-05-15 · 💻 cs.AI · cs.CL

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

Pith reviewed 2026-05-20 18:48 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords clinical LLMsfully open modelsauditable AI pipelinesmedical benchmarkssynthetic clinical dataclinician validationLLM evaluation
0
0 comments X

The pith

Fully open pipelines for clinical LLMs achieve state-of-the-art performance while exposing every step for audit and reproduction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the first complete open pipeline for training clinical large language models. It starts with public medical question-answering datasets, adds synthetic questions and vignettes reviewed by clinicians, removes contaminated examples, and tests the resulting models with a judge protocol calibrated to human doctors. A reader would care because most medical AI tools hide their training data and processes, which makes independent verification difficult in high-stakes settings. This work shows that transparency does not have to come at the cost of capability.

Core claim

We introduce Fully Open Meditron as an end-to-end auditable pipeline that normalizes eight public medical QA datasets into conversational format, augments them with clinician-vetted synthetic extensions from 46,469 clinical practice guidelines and vignettes, applies system-wide decontamination and gold-label resampling, and validates outputs with a four-physician panel and an LLM-as-a-judge protocol calibrated against 204 human raters. Applying this recipe to open base models yields variants that are preferred over their bases and, in some cases, over existing closed medical models on benchmarks and vignette comparisons.

What carries the argument

The Fully Open Meditron pipeline, which combines data unification, clinician auditing of synthetic extensions, decontamination, and use-aligned evaluation to produce reproducible clinical LLMs.

If this is right

  • Open-weight models gain substantial medical capability when trained on this audited corpus.
  • The pipeline works across different base model sizes and families.
  • Evaluation can rely on calibrated LLM judges rather than always needing full human review.
  • Clinical decision support systems can be built with complete data provenance and reproducibility.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar pipelines might apply to other high-stakes domains where transparency matters as much as accuracy.
  • Reducing dependence on proprietary data could accelerate development of specialized models in medicine.
  • Real-world deployment tests would reveal whether benchmark gains translate to better patient outcomes.

Load-bearing premise

The clinician-created synthetic questions and vignettes accurately represent real clinical situations without adding systematic errors or biases that affect model decisions.

What would settle it

Running the open models on a held-out set of actual anonymized patient records and finding higher rates of incorrect or unsafe advice compared to closed models.

Figures

Figures reproduced from arXiv: 2605.16215 by David Sasu, Fay Elhassan, Lars Klein, Mary-Anne Hartley, Mushtaha El-Amin, Sahaj Vaidya, Victor Cartier-Negadi, Xavier Theimer-Lienhard.

Figure 1
Figure 1. Figure 1: Evolution of medical LLM performance on Healthbench over time across closed-data, open [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Fully Open Meditron Corpus construction pipeline. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Fully Open Meditron datasets in records count. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Auto-MOOVE pairwise preference results. For each prompt drawn from the MOOVE evaluation split, two model responses are evaluated by Qwen3-235B-A22B which assigns a winner (Model 1, Model 2, or Tie). Bars show the share of prompts on which each model wins, ties, or loses (N = 12,602 comparisons per pair). Judge agreement with a 204-rater human panel was validated prior to use; see App. H. (Left: Each Fully … view at source ↗
Figure 5
Figure 5. Figure 5: Per-criterion Auto-MOOVE Likert profiles for Fully Open Meditron models versus corre [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Synthetic MOOVE vs. source (nsrc = 24,679, nsyn = 24,465). Top specialties preserved in rank; difficulty shifts toward levels 4–5. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Guidelines QA vs. source (nsrc = 16,300, nsyn = 145,681, a ∼9× amplification). Difficulty is not comparable for this component, since the source consists of clinical practice guidelines rather than question– answer pairs. Both annotated axes closely match the source (JSD ≤ 0.014). Unspecified Infectious disease Neurology Gastroenterology Endocrinology Pediatrics Obstetrics General medicine Cardiology Ophth… view at source ↗
Figure 8
Figure 8. Figure 8: Synthetic Curated QA vs. source (nsrc = 211,244, nsyn = 214,654). The generator broadens coverage from the eight aggregated source datasets, promoting under-represented specialties; difficulty shift is 2.81 → 3.55. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of per-rater κ values across the 204-rater human panel, with the Auto-MOOVE judge’s κ situated within it. The judge falls within ±2σ of the human mean under both with-ties and no-ties scoring, indicating it is statistically indistinguishable from a typical human rater on this validation set. I Training details I.1 Infrastructure and framework All Fully Open Meditron models were trained on a hi… view at source ↗
Figure 10
Figure 10. Figure 10: Medical LLM Openness Tiers 30 [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
read the original abstract

Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Fully Open Meditron as the first fully open pipeline for clinical LLMs. It comprises a clinician-audited corpus that unifies eight public medical QA datasets into conversational format and augments them with three synthetic extensions (exam-style QA, guideline-grounded QA from 46,469 clinical practice guidelines, and clinical vignettes), a reproducible data-construction and training framework with system-wide decontamination and gold-label resampling, and a use-aligned evaluation protocol using LLM-as-a-judge on expert-written vignettes calibrated against 204 human raters. The recipe is applied to five fully open base models; reported results include a +6.6-point aggregate-benchmark gain for Apertus-70B-MeditronFO and a 58.6% preference rate for Gemma-3-27B-MeditronFO over MedGemma (with 58% vs 55.9% on HealthBench). The central claim is that fully open pipelines can reach domain-specific state-of-the-art performance while preserving auditability and reproducibility.

Significance. If the performance gains are attributable to genuine capability rather than training-distribution artifacts, the work is significant for establishing a concrete, end-to-end auditable recipe that closes the gap between open-weight and fully open models in medicine. Strengths include the explicit clinician auditing by a four-physician panel, the unification of public datasets with guideline-derived synthetic data, and the emphasis on decontamination and reproducible evaluation. These elements directly address the opacity problem in current LLM-based clinical decision support and provide a template that other groups can replicate or extend.

major comments (2)
  1. [Data Construction] Data Construction section: The central claim that the pipeline achieves genuine domain-specific gains rests on the assumption that the clinician-vetted synthetic extensions (especially the 46,469 guideline-grounded QA items and vignettes) faithfully represent real-world clinical distributions. The manuscript describes four-physician auditing and decontamination but provides no quantitative comparison (e.g., Kolmogorov-Smirnov tests or comorbidity-frequency tables) of the generated data against real clinical query logs or EHR statistics. Without such validation, the reported +6.6-point benchmark improvement and 58.6% preference rate could partly reflect distributional overlap with the evaluation vignettes rather than improved generalization.
  2. [Evaluation Protocol] Evaluation Protocol section: The LLM-as-a-judge protocol is calibrated on 204 human raters and applied to expert-written clinical vignettes, yet the training corpus contains similar vignette-style and guideline-derived synthetic data. The manuscript does not report a hold-out test on prospective clinical outcomes or external real-world logs, leaving open the possibility that the 58% vs 55.9% HealthBench result and overall preference scores are inflated by shared generative processes. A concrete external validation set would be required to support the claim that the gains reflect true capability rather than evaluation calibration.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'system-wide decontamination' is used without a brief parenthetical description of the exact procedure or exclusion criteria; adding one sentence would improve immediate clarity for readers.
  2. [Methods] Notation: The manuscript refers to 'FO base models' and 'MeditronFO variants' without an explicit glossary or table defining the five base models and their corresponding fine-tuned names; a small nomenclature table would reduce ambiguity.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, providing honest responses based on the scope and constraints of our work. Where feasible, we have revised the manuscript to incorporate clarifications and additional discussion.

read point-by-point responses
  1. Referee: [Data Construction] The manuscript describes four-physician auditing and decontamination but provides no quantitative comparison (e.g., Kolmogorov-Smirnov tests or comorbidity-frequency tables) of the generated data against real clinical query logs or EHR statistics. Without such validation, the reported +6.6-point benchmark improvement and 58.6% preference rate could partly reflect distributional overlap with the evaluation vignettes rather than improved generalization.

    Authors: We agree that quantitative distributional comparisons to real-world clinical logs would provide stronger evidence of representativeness. However, such logs and EHR statistics are not publicly available due to privacy regulations, preventing direct statistical tests like Kolmogorov-Smirnov or comorbidity tables from external sources. Our validation instead centers on systematic review by a four-physician panel, as described in the manuscript, combined with system-wide decontamination. We have added a dedicated limitations paragraph in the revised Data Construction section that discusses the representativeness of guideline-derived data, reports coverage statistics from the 46,469 guidelines, and notes the distinction between training distributions and evaluation benchmarks. Performance gains on multiple held-out medical benchmarks support generalization beyond any potential overlap. revision: partial

  2. Referee: [Evaluation Protocol] The LLM-as-a-judge protocol is calibrated on 204 human raters and applied to expert-written clinical vignettes, yet the training corpus contains similar vignette-style and guideline-derived synthetic data. The manuscript does not report a hold-out test on prospective clinical outcomes or external real-world logs, leaving open the possibility that the 58% vs 55.9% HealthBench result and overall preference scores are inflated by shared generative processes. A concrete external validation set would be required to support the claim that the gains reflect true capability rather than evaluation calibration.

    Authors: We appreciate the concern regarding potential calibration effects. HealthBench is an independent, externally developed benchmark that was not generated by our pipeline or synthetic processes. The LLM-as-a-judge protocol was explicitly calibrated against 204 human raters to align with expert clinical judgment, and we report results on both this protocol and standard aggregate medical benchmarks. We have revised the Evaluation Protocol section to more explicitly delineate the scope of our claims, clarify that results reflect benchmark performance rather than live clinical deployment, and state that prospective outcome validation lies outside the current computational study. We maintain that the combination of decontamination, distinct evaluation vignettes, and human calibration supports the reported gains as reflecting improved capability. revision: yes

standing simulated objections not resolved
  • Quantitative distributional analysis against real clinical query logs or EHR data, as these are not publicly accessible due to privacy regulations.
  • Prospective validation on real-world clinical outcomes, which would require IRB approval, live deployment, and access to patient data beyond the scope of this paper.

Circularity Check

0 steps flagged

No circularity: empirical gains measured on external benchmarks

full rationale

The paper's central claims rest on measured performance improvements (+6.6 points on aggregate medical benchmarks, 58% vs 55.9% on HealthBench) obtained by applying a data-construction pipeline to five base models and evaluating the resulting models against independent public QA datasets and a human-calibrated LLM-as-a-judge protocol. No derivation, equation, or first-principles step reduces to its own fitted inputs or self-citations; the synthetic extensions are generated from external guidelines, decontaminated, and then tested on separate vignettes and benchmarks whose distributions are not defined by the pipeline itself. The evaluation protocol is calibrated on 204 external human raters rather than on the training data, rendering the reported preference rates falsifiable outside the construction process.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on domain assumptions about data quality rather than new mathematical constructs or fitted parameters.

axioms (2)
  • domain assumption Clinician audits and vetting of synthetic data from guidelines produce high-fidelity, unbiased training examples representative of clinical practice.
    Invoked in the description of the three clinician-vetted synthetic extensions and the four-physician panel validation.
  • domain assumption LLM-as-a-judge scores calibrated on 204 human raters provide a reliable proxy for clinical quality on expert-written vignettes.
    Central to the use-aligned evaluation protocol.

pith-pipeline@v0.9.0 · 5922 in / 1473 out tokens · 52539 ms · 2026-05-20T18:48:44.861286+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.