Is Biomedical Specialization Still Worth It? Insights from Domain-Adaptive Language Modelling with a New French Health Corpus
Pith reviewed 2026-05-10 18:24 UTC · model grok-4.3
The pith
Domain-adaptive pre-training on a new French biomedical corpus often fails to improve specialized performance and degrades general capabilities, though model merging afterward can restore balance and sometimes enhance domain results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Continued pre-training on the collected French biomedical corpus does not reliably outperform the original base models on domain-specific benchmarks and typically reduces accuracy on general tasks, but merging the resulting model with the base model mitigates these trade-offs and in some cases raises performance on the very tasks the adaptation targeted.
What carries the argument
Domain-adaptive pre-training (DAPT) through causal language modeling on the new French health corpus, followed by post-training model merging to combine adapted and base weights.
If this is right
- DAPT remains viable only in smaller-scale, resource-constrained training runs when conditions are favorable.
- Model merging after DAPT is required to avoid unacceptable drops in general capabilities.
- Merging can improve results on the specialized biomedical tasks themselves in certain settings.
- The open French corpus supports further commercial and research applications in non-English biomedical NLP.
Where Pith is reading between the lines
- For non-English biomedical work, targeted fine-tuning or instruction tuning on smaller datasets may often be preferable to full DAPT.
- The same merging strategy could be tested on other specialized domains such as legal or technical French text.
- Resource-limited teams gain a practical path to domain models without sacrificing broad usability.
- Future evaluations should include direct measures of clinical decision support rather than proxy benchmarks.
Load-bearing premise
The chosen biomedical evaluation tasks and metrics reflect actual real-world clinical performance and the new corpus is representative enough for the results to hold beyond the tested models.
What would settle it
A broader clinical benchmark where DAPT models without merging consistently outperform both base models and merged versions across multiple real-world French medical tasks.
Figures
read the original abstract
Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, yet their adaptation to specialized fields remains challenging, particularly for non-English languages. This study investigates domain-adaptive pre-training (DAPT) as a strategy for specializing small to mid-sized LLMs in the French biomedical domain through continued pre-training. We address two key research questions: the viability of specialized continued pre-training for domain adaptation and the relationship between domain-specific performance gains and general capability degradation. Our contributions include the release of a fully open-licensed French biomedical corpus suitable for commercial and open-source applications, the training and release of specialized French biomedical LLMs, and novel insights for DAPT implementation. Our methodology encompasses the collection and refinement of high-quality French biomedical texts, the exploration of causal language modeling approaches using DAPT, and conducting extensive comparative evaluations. Our results cast doubt on the efficacy of DAPT, in contrast to previous works, but we highlight its viability in smaller-scale, resource-constrained scenarios under the right conditions. Findings in this paper further suggest that model merging post-DAPT is essential to mitigate generalization trade-offs, and in some cases even improves performance on specialized tasks at which the DAPT was directed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines domain-adaptive pre-training (DAPT) for small to mid-sized LLMs in the French biomedical domain. It releases a new open-licensed French health corpus, trains and releases specialized models via continued causal language modeling, and conducts comparative evaluations against baselines. The central claims are that DAPT shows limited efficacy relative to prior work (viable mainly in resource-constrained settings) and that post-DAPT model merging is necessary to mitigate generalization trade-offs while sometimes boosting the targeted specialized performance.
Significance. If the empirical results hold under scrutiny, the work would be significant for challenging the assumed benefits of DAPT in non-English specialized domains, providing a new community resource (the corpus and models), and offering practical guidance on merging as a mitigation strategy. The open release of the corpus and models supports reproducibility and further research.
major comments (2)
- [Abstract and Results] Abstract and Results section: The headline claim that the results 'cast doubt on the efficacy of DAPT, in contrast to previous works' is load-bearing for the paper's contribution, yet the manuscript provides no quantitative deltas, error bars, or direct side-by-side comparison of evaluation protocols against the cited prior DAPT studies; without this, the contrast cannot be verified as general rather than setup-specific.
- [Evaluation and Corpus] Evaluation and Corpus sections: The conclusion that DAPT remains viable only under 'the right conditions' in smaller-scale scenarios and that merging is 'essential' rests on the assumption that the chosen French biomedical tasks and the new corpus faithfully measure domain adaptation; the paper does not report task representativeness checks, sub-domain coverage statistics, or sensitivity analyses to alternative task selections, leaving open the possibility that observed trade-offs are artifacts of the evaluation regime.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a brief table summarizing the key quantitative findings (e.g., performance deltas on specialized vs. general tasks) to allow readers to assess the claims immediately.
- [Methodology] Notation for model variants (e.g., base vs. DAPT vs. merged) should be defined consistently in a single table or section to avoid ambiguity when comparing results across experiments.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, providing the strongest honest defense of our work while agreeing to revisions that strengthen the manuscript without misrepresenting our results or setup.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results section: The headline claim that the results 'cast doubt on the efficacy of DAPT, in contrast to previous works' is load-bearing for the paper's contribution, yet the manuscript provides no quantitative deltas, error bars, or direct side-by-side comparison of evaluation protocols against the cited prior DAPT studies; without this, the contrast cannot be verified as general rather than setup-specific.
Authors: We acknowledge the importance of making the contrast with prior DAPT work more explicit and verifiable. The results section reports absolute performance metrics on our French biomedical tasks for both DAPT models and strong baselines, and the discussion references key prior studies (e.g., on English biomedical DAPT). However, we agree that the abstract claim would be better supported by explicit quantitative deltas (e.g., percentage point changes relative to the cited works where task overlap allows) and error bars on figures. We will add a dedicated comparison table and error bars in the revision. Direct protocol alignment is inherently limited by language and task differences (French vs. English, available benchmarks), which we will discuss more explicitly as a caveat rather than claiming a universal contrast. This revision clarifies the contribution without overstating generality. revision: yes
-
Referee: [Evaluation and Corpus] Evaluation and Corpus sections: The conclusion that DAPT remains viable only under 'the right conditions' in smaller-scale scenarios and that merging is 'essential' rests on the assumption that the chosen French biomedical tasks and the new corpus faithfully measure domain adaptation; the paper does not report task representativeness checks, sub-domain coverage statistics, or sensitivity analyses to alternative task selections, leaving open the possibility that observed trade-offs are artifacts of the evaluation regime.
Authors: We agree that explicit validation of the evaluation regime strengthens the claims. The tasks were chosen as the most relevant available French biomedical benchmarks (covering clinical, pharmacological, and medical literature domains), and the corpus was constructed via targeted collection from open French health sources with sub-domain balancing during curation. To address the concern, we will add sub-domain coverage statistics (e.g., token distribution across clinical notes, research articles, and guidelines) and a rationale section for task selection in the revised manuscript. Full sensitivity analyses to entirely new task sets would require additional labeled data not available in the current release, so we will frame this as a limitation and note that the open corpus and models enable community-led checks. The observed trade-offs are consistent across the evaluated tasks, supporting the 'right conditions' and merging conclusions within the reported scope. revision: partial
Circularity Check
No circularity: empirical corpus release and evaluations are self-contained
full rationale
The paper is an empirical study that collects and releases a new open French biomedical corpus, performs continued pre-training (DAPT) on small-to-mid LLMs, and reports comparative results on domain-specific tasks versus general capabilities. No equations, derivations, or first-principles predictions appear; claims about DAPT viability, generalization trade-offs, and post-DAPT merging rest on experimental outcomes rather than any reduction to fitted inputs or self-referential definitions. Self-citations (if present for prior DAPT literature) are not load-bearing because the central contributions are the new corpus, trained models, and fresh evaluations. The work is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard causal language modeling objectives and downstream task evaluations accurately measure domain adaptation and generalization trade-offs.
Reference graph
Works this paper leans on
-
[1]
Introduction LLMs are widely recognized asfoundation models that demonstrate promising general capabilities, often exhibiting emergent reasoning abilities with appropriate prompting (Bommasani et al., 2021). However, achieving high performance and clinical reliability in specialized areas requires thoughtful adaptation. Domain-Adaptive Pre-training (DAPT,...
work page 2021
-
[2]
Related Work The application of LLMs to medicine has resulted in several high-profile models, predominantly in En- glish, trained via proprietary or open-source DAPT methodologies, often relying on massive datasets of biomedical literature. Google’s Med-PaLM (Sing- hal et al., 2023), for example, built upon a 540- billion parameter foundation model, achie...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
ThePARCOMEDCorpus 3.1. Context The availability of French biomedical data remains a major challenge for improving the multilingual capabilities of large language models (LLMs) in the medical domain. Weintroduceandreleasethe PARCOMEDcorpus, a comprehensive collection of French biomedical texts compiled from a wide range of sources. Al- though collections o...
work page 2023
-
[4]
Domain-Adaptive Continual Pre-Training for Medical Applications in French Theexperimentalmethodologydiscussedinthispa- per proceeds in three main steps: model selection, DAPT,andmerging. Firstly,werunarangeofbase- line evaluations and selected the best-performing generalist foundation models for DAPT (the evalu- ation protocol is presented in Section 5). ...
work page 2024
-
[5]
Evaluation Protocol The evaluation methodology presented in this pa- per relies on a set of standardized LLM evaluation benchmarks in both English and French. The spe- cific aims of this evaluation framework are firstly to evaluate whether or not specializing LLMs from the general domain improves their performance on biomedical tasks, and secondly to comp...
-
[6]
for few-shot language model assessment, which ensures full reproducibility through open and publicly available datasets. In order to measure the trade-off between specialization and generalization brought about by the DAPT strategy outlined in Section 4, we define four task groups: one in the target domain (medicine) and the target language (French), one ...
work page 2021
-
[7]
Results and Analysis Figure 1 displays the progression of the weighted average score over the PDAPT training process for each of the four members of the Qwen3 fam- ily considered. As the MMLU-Pro-X datasets that make up the “OTHER” task groupings have more difficult questions in larger quantities (they were specifically designed to be more challenging tha...
-
[8]
Conclusion This work introduces thePARCOMED corpus, the first French biomedical corpus collection with full licensing compatibility for all downstream applica- tions, addressing a gap in openly-available domain- specific resources. Accompanying this corpus, we release the Qwen3-PDAPT collection, a se- ries of decoder-only language models based on the Qwen...
work page 2023
-
[9]
Digital Commons for Generative Ar- tificial Intelligence
Acknowledgements ThisworkwascarriedoutaspartofthePARTAGES project, winner of the Bpifrance France 2030 call for proposals “Digital Commons for Generative Ar- tificial Intelligence”. It was also partially supported by the French National Research Agency (ANR) through the MIAI “AI & Language” chair (ANR-23- IACL-0006). This work was performed using HPC reso...
work page 2030
-
[10]
References Project Apertus, Alejandro Hernández-Cano, Alexander Hägele, Allen Hao Huang, Ange- lika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Ed- uard Frank Ďurech, Ido Hakimi, Juan García Giraldo, Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, and Vinko Sabolčec. 2025. Apertus: Democratizing open a...
work page 2025
-
[11]
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
Biomedlm: A 2.7b parameter language model trained on biomedical text. Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Ni- ladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Da...
work page internal anchor Pith review arXiv 2021
-
[12]
Medalpaca–an open-source collection of medical conversational ai models and training data
Classification de cas cliniques et éval- uation automatique de réponses d’étudiants: présentation de la campagne deft 2021 (clinical cases classification and automatic evaluation of student answers: Presentation of the deft 2021 challenge). InActes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Atelier DÉfi Fouille de Textes (D...
-
[13]
A french medical conversations corpus an- notated for a virtual patient dialogue system. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 574–580. Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexan- dre Allauzen, Benoit Crabbe, Laurent Besacier, and Didier Schwab. 2020. Flaubert: ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.