Is Biomedical Specialization Still Worth It? Insights from Domain-Adaptive Language Modelling with a New French Health Corpus

Aidan Mannion; Armand Violle; C\'ecile Macaire; Didier Schwab; Fran\c{c}ois Portet; Lorraine Goeuriot; St\'ephane Ohayon; Xavier Tannier

arxiv: 2604.06903 · v1 · submitted 2026-04-08 · 💻 cs.CL

Is Biomedical Specialization Still Worth It? Insights from Domain-Adaptive Language Modelling with a New French Health Corpus

Aidan Mannion , C\'ecile Macaire , Armand Violle , St\'ephane Ohayon , Xavier Tannier , Didier Schwab , Lorraine Goeuriot , Fran\c{c}ois Portet This is my paper

Pith reviewed 2026-05-10 18:24 UTC · model grok-4.3

classification 💻 cs.CL

keywords domain-adaptive pre-trainingFrench biomedical corpuslanguage model specializationmodel mergingbiomedical NLPnon-English LLMscontinued pre-trainingFrench health data

0 comments

The pith

Domain-adaptive pre-training on a new French biomedical corpus often fails to improve specialized performance and degrades general capabilities, though model merging afterward can restore balance and sometimes enhance domain results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether continued pre-training on French health texts still justifies the effort for making language models more effective in biomedicine. It releases an open French biomedical corpus and trains several small to mid-sized models on it. Experiments show that this specialization step frequently produces no clear gains on medical tasks while hurting performance on everyday language use. The work finds that merging the adapted model back with its original base version largely fixes the generalization loss and can even lift scores on the targeted biomedical evaluations in smaller training setups.

Core claim

Continued pre-training on the collected French biomedical corpus does not reliably outperform the original base models on domain-specific benchmarks and typically reduces accuracy on general tasks, but merging the resulting model with the base model mitigates these trade-offs and in some cases raises performance on the very tasks the adaptation targeted.

What carries the argument

Domain-adaptive pre-training (DAPT) through causal language modeling on the new French health corpus, followed by post-training model merging to combine adapted and base weights.

If this is right

DAPT remains viable only in smaller-scale, resource-constrained training runs when conditions are favorable.
Model merging after DAPT is required to avoid unacceptable drops in general capabilities.
Merging can improve results on the specialized biomedical tasks themselves in certain settings.
The open French corpus supports further commercial and research applications in non-English biomedical NLP.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

For non-English biomedical work, targeted fine-tuning or instruction tuning on smaller datasets may often be preferable to full DAPT.
The same merging strategy could be tested on other specialized domains such as legal or technical French text.
Resource-limited teams gain a practical path to domain models without sacrificing broad usability.
Future evaluations should include direct measures of clinical decision support rather than proxy benchmarks.

Load-bearing premise

The chosen biomedical evaluation tasks and metrics reflect actual real-world clinical performance and the new corpus is representative enough for the results to hold beyond the tested models.

What would settle it

A broader clinical benchmark where DAPT models without merging consistently outperform both base models and merged versions across multiple real-world French medical tasks.

Figures

Figures reproduced from arXiv: 2604.06903 by Aidan Mannion, Armand Violle, C\'ecile Macaire, Didier Schwab, Fran\c{c}ois Portet, Lorraine Goeuriot, St\'ephane Ohayon, Xavier Tannier.

**Figure 1.** Figure 1: Progression of evaluation scores on the four task groups. on computational resources and restrictions on utilizing external model providers. Furthermore, highly specialized pre-training targeting narrow biomedical subdomains may yield more substantial performance gains than broad domain adaptation. Finally, our results demonstrate that merging domain-adapted models with their generalist base counterpart… view at source ↗

**Figure 2.** Figure 2: Comparison of group-level averages for base vs. specialized models, along with the SLERP [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, yet their adaptation to specialized fields remains challenging, particularly for non-English languages. This study investigates domain-adaptive pre-training (DAPT) as a strategy for specializing small to mid-sized LLMs in the French biomedical domain through continued pre-training. We address two key research questions: the viability of specialized continued pre-training for domain adaptation and the relationship between domain-specific performance gains and general capability degradation. Our contributions include the release of a fully open-licensed French biomedical corpus suitable for commercial and open-source applications, the training and release of specialized French biomedical LLMs, and novel insights for DAPT implementation. Our methodology encompasses the collection and refinement of high-quality French biomedical texts, the exploration of causal language modeling approaches using DAPT, and conducting extensive comparative evaluations. Our results cast doubt on the efficacy of DAPT, in contrast to previous works, but we highlight its viability in smaller-scale, resource-constrained scenarios under the right conditions. Findings in this paper further suggest that model merging post-DAPT is essential to mitigate generalization trade-offs, and in some cases even improves performance on specialized tasks at which the DAPT was directed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New open French biomedical corpus is the real value here, while the DAPT and merging claims need more visible evidence to convince.

read the letter

The main things to know are that this paper releases a new fully open French biomedical corpus and reports that DAPT on it does not always pay off without model merging to recover general capabilities. The corpus release is the clearest positive. Open data for French health texts is hard to come by, and making it available for both commercial and open-source use is a practical step forward. They also put out some trained models, which can serve as starting points for others. The experiments address the questions about DAPT viability and trade-offs with general performance. They describe collecting and refining texts and running continued pre-training, which is standard but done here for this specific language and domain. The soft spots center on the results. The claim that their findings cast doubt on DAPT efficacy rests on their evaluations, but the abstract gives no numbers or details on baselines and metrics. This makes it hard to see how much the contrast with prior work actually holds up. The stress test points to evaluation tasks and corpus representativeness as the weak link—if the tasks are too narrow or the data skewed toward certain medical subfields, the observed need for merging could be an artifact of the setup rather than a broad rule. This paper is for people working on domain adaptation of LLMs in biomedical or other specialized non-English settings. A reader who needs French data or is testing merging techniques after adaptation would get the most out of it. I recommend sending it to peer review. The data contribution is solid enough to warrant referees checking the full experimental details.

Referee Report

2 major / 2 minor

Summary. The paper examines domain-adaptive pre-training (DAPT) for small to mid-sized LLMs in the French biomedical domain. It releases a new open-licensed French health corpus, trains and releases specialized models via continued causal language modeling, and conducts comparative evaluations against baselines. The central claims are that DAPT shows limited efficacy relative to prior work (viable mainly in resource-constrained settings) and that post-DAPT model merging is necessary to mitigate generalization trade-offs while sometimes boosting the targeted specialized performance.

Significance. If the empirical results hold under scrutiny, the work would be significant for challenging the assumed benefits of DAPT in non-English specialized domains, providing a new community resource (the corpus and models), and offering practical guidance on merging as a mitigation strategy. The open release of the corpus and models supports reproducibility and further research.

major comments (2)

[Abstract and Results] Abstract and Results section: The headline claim that the results 'cast doubt on the efficacy of DAPT, in contrast to previous works' is load-bearing for the paper's contribution, yet the manuscript provides no quantitative deltas, error bars, or direct side-by-side comparison of evaluation protocols against the cited prior DAPT studies; without this, the contrast cannot be verified as general rather than setup-specific.
[Evaluation and Corpus] Evaluation and Corpus sections: The conclusion that DAPT remains viable only under 'the right conditions' in smaller-scale scenarios and that merging is 'essential' rests on the assumption that the chosen French biomedical tasks and the new corpus faithfully measure domain adaptation; the paper does not report task representativeness checks, sub-domain coverage statistics, or sensitivity analyses to alternative task selections, leaving open the possibility that observed trade-offs are artifacts of the evaluation regime.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief table summarizing the key quantitative findings (e.g., performance deltas on specialized vs. general tasks) to allow readers to assess the claims immediately.
[Methodology] Notation for model variants (e.g., base vs. DAPT vs. merged) should be defined consistently in a single table or section to avoid ambiguity when comparing results across experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below, providing the strongest honest defense of our work while agreeing to revisions that strengthen the manuscript without misrepresenting our results or setup.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results section: The headline claim that the results 'cast doubt on the efficacy of DAPT, in contrast to previous works' is load-bearing for the paper's contribution, yet the manuscript provides no quantitative deltas, error bars, or direct side-by-side comparison of evaluation protocols against the cited prior DAPT studies; without this, the contrast cannot be verified as general rather than setup-specific.

Authors: We acknowledge the importance of making the contrast with prior DAPT work more explicit and verifiable. The results section reports absolute performance metrics on our French biomedical tasks for both DAPT models and strong baselines, and the discussion references key prior studies (e.g., on English biomedical DAPT). However, we agree that the abstract claim would be better supported by explicit quantitative deltas (e.g., percentage point changes relative to the cited works where task overlap allows) and error bars on figures. We will add a dedicated comparison table and error bars in the revision. Direct protocol alignment is inherently limited by language and task differences (French vs. English, available benchmarks), which we will discuss more explicitly as a caveat rather than claiming a universal contrast. This revision clarifies the contribution without overstating generality. revision: yes
Referee: [Evaluation and Corpus] Evaluation and Corpus sections: The conclusion that DAPT remains viable only under 'the right conditions' in smaller-scale scenarios and that merging is 'essential' rests on the assumption that the chosen French biomedical tasks and the new corpus faithfully measure domain adaptation; the paper does not report task representativeness checks, sub-domain coverage statistics, or sensitivity analyses to alternative task selections, leaving open the possibility that observed trade-offs are artifacts of the evaluation regime.

Authors: We agree that explicit validation of the evaluation regime strengthens the claims. The tasks were chosen as the most relevant available French biomedical benchmarks (covering clinical, pharmacological, and medical literature domains), and the corpus was constructed via targeted collection from open French health sources with sub-domain balancing during curation. To address the concern, we will add sub-domain coverage statistics (e.g., token distribution across clinical notes, research articles, and guidelines) and a rationale section for task selection in the revised manuscript. Full sensitivity analyses to entirely new task sets would require additional labeled data not available in the current release, so we will frame this as a limitation and note that the open corpus and models enable community-led checks. The observed trade-offs are consistent across the evaluated tasks, supporting the 'right conditions' and merging conclusions within the reported scope. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical corpus release and evaluations are self-contained

full rationale

The paper is an empirical study that collects and releases a new open French biomedical corpus, performs continued pre-training (DAPT) on small-to-mid LLMs, and reports comparative results on domain-specific tasks versus general capabilities. No equations, derivations, or first-principles predictions appear; claims about DAPT viability, generalization trade-offs, and post-DAPT merging rest on experimental outcomes rather than any reduction to fitted inputs or self-referential definitions. Self-citations (if present for prior DAPT literature) are not load-bearing because the central contributions are the new corpus, trained models, and fresh evaluations. The work is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical training and evaluation of LLMs using a newly collected corpus; no free parameters are explicitly fitted to produce the headline result, and no new entities are postulated.

axioms (1)

domain assumption Standard causal language modeling objectives and downstream task evaluations accurately measure domain adaptation and generalization trade-offs.
Invoked implicitly when interpreting DAPT results versus general capabilities.

pith-pipeline@v0.9.0 · 5549 in / 1271 out tokens · 45508 ms · 2026-05-10T18:24:39.770104+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

[1]

However, achieving high performance and clinical reliability in specialized areas requires thoughtful adaptation

Introduction LLMs are widely recognized asfoundation models that demonstrate promising general capabilities, often exhibiting emergent reasoning abilities with appropriate prompting (Bommasani et al., 2021). However, achieving high performance and clinical reliability in specialized areas requires thoughtful adaptation. Domain-Adaptive Pre-training (DAPT,...

work page 2021
[2]

Related Work The application of LLMs to medicine has resulted in several high-profile models, predominantly in En- glish, trained via proprietary or open-source DAPT methodologies, often relying on massive datasets of biomedical literature. Google’s Med-PaLM (Sing- hal et al., 2023), for example, built upon a 540- billion parameter foundation model, achie...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Context The availability of French biomedical data remains a major challenge for improving the multilingual capabilities of large language models (LLMs) in the medical domain

ThePARCOMEDCorpus 3.1. Context The availability of French biomedical data remains a major challenge for improving the multilingual capabilities of large language models (LLMs) in the medical domain. Weintroduceandreleasethe PARCOMEDcorpus, a comprehensive collection of French biomedical texts compiled from a wide range of sources. Al- though collections o...

work page 2023
[4]

Best-performing

Domain-Adaptive Continual Pre-Training for Medical Applications in French Theexperimentalmethodologydiscussedinthispa- per proceeds in three main steps: model selection, DAPT,andmerging. Firstly,werunarangeofbase- line evaluations and selected the best-performing generalist foundation models for DAPT (the evalu- ation protocol is presented in Section 5). ...

work page 2024
[5]

lm-evaluation-harness

Evaluation Protocol The evaluation methodology presented in this pa- per relies on a set of standardized LLM evaluation benchmarks in both English and French. The spe- cific aims of this evaluation framework are firstly to evaluate whether or not specializing LLMs from the general domain improves their performance on biomedical tasks, and secondly to comp...

work page
[6]

Accuracy

for few-shot language model assessment, which ensures full reproducibility through open and publicly available datasets. In order to measure the trade-off between specialization and generalization brought about by the DAPT strategy outlined in Section 4, we define four task groups: one in the target domain (medicine) and the target language (French), one ...

work page 2021
[7]

Results and Analysis Figure 1 displays the progression of the weighted average score over the PDAPT training process for each of the four members of the Qwen3 fam- ily considered. As the MMLU-Pro-X datasets that make up the “OTHER” task groupings have more difficult questions in larger quantities (they were specifically designed to be more challenging tha...

work page
[8]

Conclusion This work introduces thePARCOMED corpus, the first French biomedical corpus collection with full licensing compatibility for all downstream applica- tions, addressing a gap in openly-available domain- specific resources. Accompanying this corpus, we release the Qwen3-PDAPT collection, a se- ries of decoder-only language models based on the Qwen...

work page 2023
[9]

Digital Commons for Generative Ar- tificial Intelligence

Acknowledgements ThisworkwascarriedoutaspartofthePARTAGES project, winner of the Bpifrance France 2030 call for proposals “Digital Commons for Generative Ar- tificial Intelligence”. It was also partially supported by the French National Research Agency (ANR) through the MIAI “AI & Language” chair (ANR-23- IACL-0006). This work was performed using HPC reso...

work page 2030
[10]

References Project Apertus, Alejandro Hernández-Cano, Alexander Hägele, Allen Hao Huang, Ange- lika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Ed- uard Frank Ďurech, Ido Hakimi, Juan García Giraldo, Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, and Vinko Sabolčec. 2025. Apertus: Democratizing open a...

work page 2025
[11]

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Biomedlm: A 2.7b parameter language model trained on biomedical text. Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Ni- ladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Da...

work page internal anchor Pith review arXiv 2021
[12]

Medalpaca–an open-source collection of medical conversational ai models and training data

Classification de cas cliniques et éval- uation automatique de réponses d’étudiants: présentation de la campagne deft 2021 (clinical cases classification and automatic evaluation of student answers: Presentation of the deft 2021 challenge). InActes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Atelier DÉfi Fouille de Textes (D...

work page arXiv 2021
[13]

Bernard L Welch

A french medical conversations corpus an- notated for a virtual patient dialogue system. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 574–580. Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexan- dre Allauzen, Benoit Crabbe, Laurent Besacier, and Didier Schwab. 2020. Flaubert: ...

work page arXiv 2020

[1] [1]

However, achieving high performance and clinical reliability in specialized areas requires thoughtful adaptation

Introduction LLMs are widely recognized asfoundation models that demonstrate promising general capabilities, often exhibiting emergent reasoning abilities with appropriate prompting (Bommasani et al., 2021). However, achieving high performance and clinical reliability in specialized areas requires thoughtful adaptation. Domain-Adaptive Pre-training (DAPT,...

work page 2021

[2] [2]

Related Work The application of LLMs to medicine has resulted in several high-profile models, predominantly in En- glish, trained via proprietary or open-source DAPT methodologies, often relying on massive datasets of biomedical literature. Google’s Med-PaLM (Sing- hal et al., 2023), for example, built upon a 540- billion parameter foundation model, achie...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Context The availability of French biomedical data remains a major challenge for improving the multilingual capabilities of large language models (LLMs) in the medical domain

ThePARCOMEDCorpus 3.1. Context The availability of French biomedical data remains a major challenge for improving the multilingual capabilities of large language models (LLMs) in the medical domain. Weintroduceandreleasethe PARCOMEDcorpus, a comprehensive collection of French biomedical texts compiled from a wide range of sources. Al- though collections o...

work page 2023

[4] [4]

Best-performing

Domain-Adaptive Continual Pre-Training for Medical Applications in French Theexperimentalmethodologydiscussedinthispa- per proceeds in three main steps: model selection, DAPT,andmerging. Firstly,werunarangeofbase- line evaluations and selected the best-performing generalist foundation models for DAPT (the evalu- ation protocol is presented in Section 5). ...

work page 2024

[5] [5]

lm-evaluation-harness

Evaluation Protocol The evaluation methodology presented in this pa- per relies on a set of standardized LLM evaluation benchmarks in both English and French. The spe- cific aims of this evaluation framework are firstly to evaluate whether or not specializing LLMs from the general domain improves their performance on biomedical tasks, and secondly to comp...

work page

[6] [6]

Accuracy

for few-shot language model assessment, which ensures full reproducibility through open and publicly available datasets. In order to measure the trade-off between specialization and generalization brought about by the DAPT strategy outlined in Section 4, we define four task groups: one in the target domain (medicine) and the target language (French), one ...

work page 2021

[7] [7]

Results and Analysis Figure 1 displays the progression of the weighted average score over the PDAPT training process for each of the four members of the Qwen3 fam- ily considered. As the MMLU-Pro-X datasets that make up the “OTHER” task groupings have more difficult questions in larger quantities (they were specifically designed to be more challenging tha...

work page

[8] [8]

Conclusion This work introduces thePARCOMED corpus, the first French biomedical corpus collection with full licensing compatibility for all downstream applica- tions, addressing a gap in openly-available domain- specific resources. Accompanying this corpus, we release the Qwen3-PDAPT collection, a se- ries of decoder-only language models based on the Qwen...

work page 2023

[9] [9]

Digital Commons for Generative Ar- tificial Intelligence

Acknowledgements ThisworkwascarriedoutaspartofthePARTAGES project, winner of the Bpifrance France 2030 call for proposals “Digital Commons for Generative Ar- tificial Intelligence”. It was also partially supported by the French National Research Agency (ANR) through the MIAI “AI & Language” chair (ANR-23- IACL-0006). This work was performed using HPC reso...

work page 2030

[10] [10]

References Project Apertus, Alejandro Hernández-Cano, Alexander Hägele, Allen Hao Huang, Ange- lika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Ed- uard Frank Ďurech, Ido Hakimi, Juan García Giraldo, Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, and Vinko Sabolčec. 2025. Apertus: Democratizing open a...

work page 2025

[11] [11]

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Biomedlm: A 2.7b parameter language model trained on biomedical text. Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Ni- ladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Da...

work page internal anchor Pith review arXiv 2021

[12] [12]

Medalpaca–an open-source collection of medical conversational ai models and training data

Classification de cas cliniques et éval- uation automatique de réponses d’étudiants: présentation de la campagne deft 2021 (clinical cases classification and automatic evaluation of student answers: Presentation of the deft 2021 challenge). InActes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Atelier DÉfi Fouille de Textes (D...

work page arXiv 2021

[13] [13]

Bernard L Welch

A french medical conversations corpus an- notated for a virtual patient dialogue system. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 574–580. Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexan- dre Allauzen, Benoit Crabbe, Laurent Besacier, and Didier Schwab. 2020. Flaubert: ...

work page arXiv 2020