Fully Open Meditron: An Auditable Pipeline for Clinical LLMs
Pith reviewed 2026-05-20 18:48 UTC · model grok-4.3
The pith
Fully open pipelines for clinical LLMs achieve state-of-the-art performance while exposing every step for audit and reproduction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Fully Open Meditron as an end-to-end auditable pipeline that normalizes eight public medical QA datasets into conversational format, augments them with clinician-vetted synthetic extensions from 46,469 clinical practice guidelines and vignettes, applies system-wide decontamination and gold-label resampling, and validates outputs with a four-physician panel and an LLM-as-a-judge protocol calibrated against 204 human raters. Applying this recipe to open base models yields variants that are preferred over their bases and, in some cases, over existing closed medical models on benchmarks and vignette comparisons.
What carries the argument
The Fully Open Meditron pipeline, which combines data unification, clinician auditing of synthetic extensions, decontamination, and use-aligned evaluation to produce reproducible clinical LLMs.
If this is right
- Open-weight models gain substantial medical capability when trained on this audited corpus.
- The pipeline works across different base model sizes and families.
- Evaluation can rely on calibrated LLM judges rather than always needing full human review.
- Clinical decision support systems can be built with complete data provenance and reproducibility.
Where Pith is reading between the lines
- Similar pipelines might apply to other high-stakes domains where transparency matters as much as accuracy.
- Reducing dependence on proprietary data could accelerate development of specialized models in medicine.
- Real-world deployment tests would reveal whether benchmark gains translate to better patient outcomes.
Load-bearing premise
The clinician-created synthetic questions and vignettes accurately represent real clinical situations without adding systematic errors or biases that affect model decisions.
What would settle it
Running the open models on a held-out set of actual anonymized patient records and finding higher rates of incorrect or unsafe advice compared to closed models.
Figures
read the original abstract
Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Fully Open Meditron as the first fully open pipeline for clinical LLMs. It comprises a clinician-audited corpus that unifies eight public medical QA datasets into conversational format and augments them with three synthetic extensions (exam-style QA, guideline-grounded QA from 46,469 clinical practice guidelines, and clinical vignettes), a reproducible data-construction and training framework with system-wide decontamination and gold-label resampling, and a use-aligned evaluation protocol using LLM-as-a-judge on expert-written vignettes calibrated against 204 human raters. The recipe is applied to five fully open base models; reported results include a +6.6-point aggregate-benchmark gain for Apertus-70B-MeditronFO and a 58.6% preference rate for Gemma-3-27B-MeditronFO over MedGemma (with 58% vs 55.9% on HealthBench). The central claim is that fully open pipelines can reach domain-specific state-of-the-art performance while preserving auditability and reproducibility.
Significance. If the performance gains are attributable to genuine capability rather than training-distribution artifacts, the work is significant for establishing a concrete, end-to-end auditable recipe that closes the gap between open-weight and fully open models in medicine. Strengths include the explicit clinician auditing by a four-physician panel, the unification of public datasets with guideline-derived synthetic data, and the emphasis on decontamination and reproducible evaluation. These elements directly address the opacity problem in current LLM-based clinical decision support and provide a template that other groups can replicate or extend.
major comments (2)
- [Data Construction] Data Construction section: The central claim that the pipeline achieves genuine domain-specific gains rests on the assumption that the clinician-vetted synthetic extensions (especially the 46,469 guideline-grounded QA items and vignettes) faithfully represent real-world clinical distributions. The manuscript describes four-physician auditing and decontamination but provides no quantitative comparison (e.g., Kolmogorov-Smirnov tests or comorbidity-frequency tables) of the generated data against real clinical query logs or EHR statistics. Without such validation, the reported +6.6-point benchmark improvement and 58.6% preference rate could partly reflect distributional overlap with the evaluation vignettes rather than improved generalization.
- [Evaluation Protocol] Evaluation Protocol section: The LLM-as-a-judge protocol is calibrated on 204 human raters and applied to expert-written clinical vignettes, yet the training corpus contains similar vignette-style and guideline-derived synthetic data. The manuscript does not report a hold-out test on prospective clinical outcomes or external real-world logs, leaving open the possibility that the 58% vs 55.9% HealthBench result and overall preference scores are inflated by shared generative processes. A concrete external validation set would be required to support the claim that the gains reflect true capability rather than evaluation calibration.
minor comments (2)
- [Abstract] Abstract: The phrase 'system-wide decontamination' is used without a brief parenthetical description of the exact procedure or exclusion criteria; adding one sentence would improve immediate clarity for readers.
- [Methods] Notation: The manuscript refers to 'FO base models' and 'MeditronFO variants' without an explicit glossary or table defining the five base models and their corresponding fine-tuned names; a small nomenclature table would reduce ambiguity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below, providing honest responses based on the scope and constraints of our work. Where feasible, we have revised the manuscript to incorporate clarifications and additional discussion.
read point-by-point responses
-
Referee: [Data Construction] The manuscript describes four-physician auditing and decontamination but provides no quantitative comparison (e.g., Kolmogorov-Smirnov tests or comorbidity-frequency tables) of the generated data against real clinical query logs or EHR statistics. Without such validation, the reported +6.6-point benchmark improvement and 58.6% preference rate could partly reflect distributional overlap with the evaluation vignettes rather than improved generalization.
Authors: We agree that quantitative distributional comparisons to real-world clinical logs would provide stronger evidence of representativeness. However, such logs and EHR statistics are not publicly available due to privacy regulations, preventing direct statistical tests like Kolmogorov-Smirnov or comorbidity tables from external sources. Our validation instead centers on systematic review by a four-physician panel, as described in the manuscript, combined with system-wide decontamination. We have added a dedicated limitations paragraph in the revised Data Construction section that discusses the representativeness of guideline-derived data, reports coverage statistics from the 46,469 guidelines, and notes the distinction between training distributions and evaluation benchmarks. Performance gains on multiple held-out medical benchmarks support generalization beyond any potential overlap. revision: partial
-
Referee: [Evaluation Protocol] The LLM-as-a-judge protocol is calibrated on 204 human raters and applied to expert-written clinical vignettes, yet the training corpus contains similar vignette-style and guideline-derived synthetic data. The manuscript does not report a hold-out test on prospective clinical outcomes or external real-world logs, leaving open the possibility that the 58% vs 55.9% HealthBench result and overall preference scores are inflated by shared generative processes. A concrete external validation set would be required to support the claim that the gains reflect true capability rather than evaluation calibration.
Authors: We appreciate the concern regarding potential calibration effects. HealthBench is an independent, externally developed benchmark that was not generated by our pipeline or synthetic processes. The LLM-as-a-judge protocol was explicitly calibrated against 204 human raters to align with expert clinical judgment, and we report results on both this protocol and standard aggregate medical benchmarks. We have revised the Evaluation Protocol section to more explicitly delineate the scope of our claims, clarify that results reflect benchmark performance rather than live clinical deployment, and state that prospective outcome validation lies outside the current computational study. We maintain that the combination of decontamination, distinct evaluation vignettes, and human calibration supports the reported gains as reflecting improved capability. revision: yes
- Quantitative distributional analysis against real clinical query logs or EHR data, as these are not publicly accessible due to privacy regulations.
- Prospective validation on real-world clinical outcomes, which would require IRB approval, live deployment, and access to patient data beyond the scope of this paper.
Circularity Check
No circularity: empirical gains measured on external benchmarks
full rationale
The paper's central claims rest on measured performance improvements (+6.6 points on aggregate medical benchmarks, 58% vs 55.9% on HealthBench) obtained by applying a data-construction pipeline to five base models and evaluating the resulting models against independent public QA datasets and a human-calibrated LLM-as-a-judge protocol. No derivation, equation, or first-principles step reduces to its own fitted inputs or self-citations; the synthetic extensions are generated from external guidelines, decontaminated, and then tested on separate vignettes and benchmarks whose distributions are not defined by the pipeline itself. The evaluation protocol is calibrated on 204 external human raters rather than on the training data, rendering the reported preference rates falsifiable outside the construction process.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Clinician audits and vetting of synthetic data from guidelines produce high-fidelity, unbiased training examples representative of clinical practice.
- domain assumption LLM-as-a-judge scores calibrated on 204 human raters provide a reliable proxy for clinical quality on expert-written vignettes.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The corpus unifies eight public medical QA datasets... expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.