Recognition: unknown
LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs
Pith reviewed 2026-05-10 17:25 UTC · model grok-4.3
The pith
Mid-size language models evaluate synthetic French medical student interviews at the same accuracy as GPT-4o.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a controlled pipeline that produces synthetic doctor-patient interview transcripts guided by scenario-specific evaluation criteria, combining ideal and perturbed performances to simulate varying student skill levels. The resulting dialogues are automatically silver-labeled through an LLM-assisted framework supporting adjustable evaluation strictness. Benchmarking multiple open-source and proprietary LLMs shows that mid-size models (≤32B parameters) achieve accuracies comparable to GPT-4o (~90%) on synthetic data, highlighting the feasibility of locally deployable, privacy-preserving evaluation systems for medical education.
What carries the argument
A controlled pipeline that generates synthetic transcripts by mixing ideal and perturbed performances under scenario-specific criteria, then applies adjustable LLM silver-labeling to produce training and test data.
Load-bearing premise
The synthetic dialogues and the labels assigned by LLMs accurately reflect what would occur in real OSCE sessions judged by human experts.
What would settle it
Collect human expert ratings on both real student OSCE transcripts and the synthetic versions, then check whether model accuracy on the synthetic set still holds or drops sharply when compared against the human ratings.
Figures
read the original abstract
Objective Structured Clinical Examinations (OSCEs) are the standard method for assessing medical students' clinical and communication skills through structured patient interviews. In France, however, the organization of training sessions is limited by human and logistical constraints, restricting students' access to repeated practice and structured feedback. Recent advances in Natural Language Processing (NLP) and Large Language Models (LLMs) now offer the opportunity to automatically evaluate such medical interviews, thereby alleviating the need for human examiners during training. Yet, real French OSCE annotated transcripts remain extremely scarce, limiting reproducible research and reliable benchmarking. To address these challenges, we investigate the use of LLMs for both generating and evaluating French OSCE dialogues in a low-resource context. We introduce a controlled pipeline that produces synthetic doctor-patient interview transcripts guided by scenario-specific evaluation criteria, combining ideal and perturbed performances to simulate varying student skill levels. The resulting dialogues are automatically silver-labeled through an LLM-assisted framework supporting adjustable evaluation strictness. Benchmarking multiple open-source and proprietary LLMs shows that mid-size models ($\le$32B parameters) achieve accuracies comparable to GPT-4o ($\sim$90\%) on synthetic data, highlighting the feasibility of locally deployable, privacy-preserving evaluation systems for medical education.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a controlled LLM pipeline to generate synthetic French OSCE doctor-patient interview transcripts by combining ideal and perturbed student performances according to scenario-specific evaluation criteria, then applies an LLM-assisted silver-labeling framework with adjustable strictness to produce evaluation labels. Benchmarking of open-source and proprietary LLMs on this synthetic data shows that mid-size models (≤32B parameters) reach accuracies of ~90%, comparable to GPT-4o, and argues this enables feasible, privacy-preserving, locally deployable evaluation systems for medical education in low-resource settings.
Significance. The work provides a practical method for addressing the scarcity of annotated French OSCE data through synthetic generation and offers evidence that smaller models can match larger ones on this task, which supports potential deployment in privacy-sensitive clinical training environments. The adjustable-strictness labeling and controlled perturbation approach are constructive contributions that could aid reproducible research if the synthetic data can be shown to align with real clinical standards.
major comments (2)
- [Section 5] Section 5 (Benchmarking and Results): The headline accuracies (~90% for ≤32B models matching GPT-4o) are computed exclusively against LLM-generated silver labels produced by the same pipeline that created the dialogues; without any human-expert re-labeling or comparison on a held-out subset of real or synthetic transcripts, the metrics do not establish that the models align with actual OSCE examiner judgments.
- [Methods] Methods (synthetic data generation and labeling pipeline): The central claim that the approach alleviates the need for human examiners during training rests on the assumption that LLM silver labels faithfully reflect clinical skills; this assumption is load-bearing yet untested, as no validation against real annotated French OSCE transcripts is reported, leaving open the possibility that high inter-model agreement arises from shared stylistic or training-data biases rather than genuine clinical fidelity.
minor comments (2)
- [Abstract] Abstract and Section 4: The description of 'adjustable evaluation strictness' would benefit from an explicit example or parameter table showing how strictness levels affect label distributions and downstream accuracy scores.
- [Conclusion] The paper should clarify whether the generation prompts, perturbation rules, and evaluation rubrics are released as supplementary material to support reproducibility claims.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The concerns about reliance on synthetic silver labels without human validation are valid and central to the work's limitations in a low-resource setting. We address each major comment point-by-point below, clarifying our design choices while acknowledging where direct evidence is absent due to data scarcity. We propose targeted revisions to strengthen the discussion of these issues.
read point-by-point responses
-
Referee: [Section 5] Section 5 (Benchmarking and Results): The headline accuracies (~90% for ≤32B models matching GPT-4o) are computed exclusively against LLM-generated silver labels produced by the same pipeline that created the dialogues; without any human-expert re-labeling or comparison on a held-out subset of real or synthetic transcripts, the metrics do not establish that the models align with actual OSCE examiner judgments.
Authors: We agree that the reported accuracies are measured against silver labels from the generation pipeline rather than human OSCE examiners, and that this does not directly prove alignment with real clinical judgments. Our pipeline constructs dialogues with explicitly defined perturbations drawn from scenario-specific evaluation criteria (e.g., missing history questions, inadequate empathy), so the silver labels encode known skill variations rather than being purely emergent. This controlled setup allows benchmarking of model consistency on the task, which is a necessary first step when real annotated French OSCE data does not exist at scale. We will revise Section 5 to explicitly state that these metrics demonstrate inter-model agreement on synthetic data and add a paragraph on the need for future human validation studies on both synthetic and any available real transcripts. revision: partial
-
Referee: [Methods] Methods (synthetic data generation and labeling pipeline): The central claim that the approach alleviates the need for human examiners during training rests on the assumption that LLM silver labels faithfully reflect clinical skills; this assumption is load-bearing yet untested, as no validation against real annotated French OSCE transcripts is reported, leaving open the possibility that high inter-model agreement arises from shared stylistic or training-data biases rather than genuine clinical fidelity.
Authors: The assumption that silver labels capture clinical skills is indeed load-bearing and remains untested against real French OSCE transcripts, as none are publicly available or accessible in sufficient quantity for this low-resource language. We mitigate bias risks by grounding both generation and labeling in published medical education rubrics and by using adjustable strictness parameters to simulate varying examiner standards. The comparable performance of mid-size open models to GPT-4o on this data supports the practical claim of enabling local, privacy-preserving systems, even if the labels are synthetic. We will add a new limitations subsection in the Methods and Discussion sections that directly addresses the risk of shared LLM biases and outlines a roadmap for human-expert validation once small real datasets become available. revision: partial
- Direct empirical validation of the LLM silver labels against human expert annotations on real French OSCE transcripts, which is impossible at present because no such annotated corpus exists at usable scale for this low-resource setting.
Circularity Check
No circularity: empirical benchmark on synthetic silver-labeled data
full rationale
The paper describes a data-generation pipeline that produces synthetic French OSCE dialogues from scenario criteria and then applies an LLM-assisted silver-labeling step before benchmarking model accuracies on those labels. No equations, fitted parameters, or derivations are presented that reduce to their own inputs by construction. No self-citations are invoked to establish uniqueness theorems or to smuggle ansatzes. The reported ~90% accuracies are direct empirical measurements against the silver labels; while this raises separate questions of external validity, it does not constitute a self-definitional loop or a prediction that is statistically forced by the labeling process itself. The work is therefore self-contained as an empirical feasibility study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs
Introduction Effective clinical and communication skills are es- sential in healthcare practice, where doctor–patient interviews form the foundation of diagnosis, treat- ment, and patient trust. However, training opportu- nities for these skills remain limited, primarily due to the dependence on human participants, which increases costs and reduces access...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
LLM-as-a-judge
Related Work AI for Pedagogical Assessmenthas been in- creasingly applied in education, with demonstrated benefits for both learning outcomes and scalability. Studies have reported measurable improvements instudentoutcomesandefficiency,includingreduc- tions in examiner workload (Alizadeh and Sameri, 2025). Simulation-based education has also seen the depl...
2025
-
[3]
strictly evaluate
Data Generation 3.1. LLM-Based Dialogue Generation ToassessthefeasibilityofautomatingOSCEevalu- ation with LLMs, transcribed doctor–patient interac- tions are required, as they provide the textual input from which the model can infer, for each evalua- tion criterion, a binary outcome indicating whether it was met or not (see Fig. 2). However, no pub- Figu...
2024
-
[4]
accuracy
Experimental Setup The overall evaluation workflow is illustrated in Fig- ure 2. For each clinical case, the input to the evalu- ationsystemconsistsofthefulltranscript, asingle criterion, and a defaulttask descriptionserving as the evaluation prompt. The task description in- structs the LLM to strictly evaluate (similar tostrict mode, see Section 3.2) the...
2022
-
[5]
Results and Discussion To assess the feasibility of LLM-based automated OSCE evaluation, we report the overall binary clas- sification accuracy across all 179 evaluated criteria, for both theperturbedandunperturbedcorpora (Table 2). Some results are omitted for brevity. Overall model performance:Large industry- leading models such asGPT-4o and Claude 4 So...
-
[6]
Conclusion In this work, we developed a controlled pipeline for generating synthetic French OSCE training tran- scripts alongside an automated evaluation frame- work for clinical-skills criteria based on locally hostable LLMs. By structuring dialogue generation around evaluation criteria and incorporating per- turbations to simulate less idealized student...
-
[7]
Benchmarking generative ai for scoring medicalstudentinterviewsinobjectivestructured clinical examinations (osces). InArtificial Intel- ligence in Education, pages 231–245, Cham. Springer Nature Switzerland. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, XuehaoZhai,ChengjinXu,WeiLi,YinghanShen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhu...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
https: //sante.gouv.fr/soins-et-maladies/ qualite-securite-et-pertinence-des-soins/ securite-des-prises-en-charge/ securite-des-soins-et-des-patients/ article/identitovigilance
Identitovigilance. https: //sante.gouv.fr/soins-et-maladies/ qualite-securite-et-pertinence-des-soins/ securite-des-prises-en-charge/ securite-des-soins-et-des-patients/ article/identitovigilance. National Library of Medicine (US). 2024. UMLS Knowledge Sources, Release 2024AA. http://www.nlm.nih.gov/ research/umls/licensedcontent/ umlsknowledgesources.htm...
2024
-
[9]
InProceedings of the 58th Annual MeetingoftheAssociationforComputationalLin- guistics, pages 7881–7892, Online
BLEURT: Learning robust metrics for text generation. InProceedings of the 58th Annual MeetingoftheAssociationforComputationalLin- guistics, pages 7881–7892, Online. Association for Computational Linguistics. H. M. T. W. Seneviratne and S. S. Manathunga
-
[10]
Artificial intelligence assisted automated short answer question scoring tool shows high correlationwithhumanexaminermarkings.BMC Medical Education, 25(1):1146. Ameer Hamza Shakur, Michael J. Holcomb, David Hein, Shinyoung Kang, Thomas O. Dalton, Krys- tle K. Campbell, Daniel J. Scott, and Andrew R. Jamieson. 2024. Large language models for medical osce a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.