arxiv: 2604.08126 · v1 · submitted 2026-04-09 · 💻 cs.CL

Recognition: unknown

LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs

Irina Illina, Tian Huang, Tom Bourgeade

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:25 UTC · model grok-4.3

classification 💻 cs.CL

keywords OSCE evaluationsynthetic dialogue generationLLM-based assessmentFrench medical educationclinical skillslow-resource NLPautomatic feedbackprivacy-preserving evaluation

0 comments

The pith

Mid-size language models evaluate synthetic French medical student interviews at the same accuracy as GPT-4o.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that large language models can generate realistic French doctor-patient interview transcripts when real annotated OSCE examples are scarce. It builds a controlled pipeline that mixes ideal performances with intentional flaws to represent different student skill levels, then applies another LLM to assign silver labels according to adjustable criteria. Testing across models shows that those with 32 billion parameters or fewer reach roughly 90 percent accuracy on the generated data, matching the largest proprietary system. This setup removes the need for constant human examiners and supports repeated practice sessions under privacy constraints. The work therefore opens a route to scalable, locally runnable tools for clinical skills training.

Core claim

We introduce a controlled pipeline that produces synthetic doctor-patient interview transcripts guided by scenario-specific evaluation criteria, combining ideal and perturbed performances to simulate varying student skill levels. The resulting dialogues are automatically silver-labeled through an LLM-assisted framework supporting adjustable evaluation strictness. Benchmarking multiple open-source and proprietary LLMs shows that mid-size models (≤32B parameters) achieve accuracies comparable to GPT-4o (~90%) on synthetic data, highlighting the feasibility of locally deployable, privacy-preserving evaluation systems for medical education.

What carries the argument

A controlled pipeline that generates synthetic transcripts by mixing ideal and perturbed performances under scenario-specific criteria, then applies adjustable LLM silver-labeling to produce training and test data.

Load-bearing premise

The synthetic dialogues and the labels assigned by LLMs accurately reflect what would occur in real OSCE sessions judged by human experts.

What would settle it

Collect human expert ratings on both real student OSCE transcripts and the synthetic versions, then check whether model accuracy on the synthetic set still holds or drops sharply when compared against the human ratings.

Figures

Figures reproduced from arXiv: 2604.08126 by Irina Illina, Tian Huang, Tom Bourgeade.

**Figure 2.** Figure 2: Overview of the evaluation pipeline: an LLM is tasked with judging whether a specific criterion is passed/failed (Done), with a justification and supporting evidence, given a transcript, and optionally some few-shot examples. they precede the binary label decision. To assess the feasibility of LLM-based automated OSCE assessment, we employ the evaluation workflow introduced above and report the overall … view at source ↗

read the original abstract

Objective Structured Clinical Examinations (OSCEs) are the standard method for assessing medical students' clinical and communication skills through structured patient interviews. In France, however, the organization of training sessions is limited by human and logistical constraints, restricting students' access to repeated practice and structured feedback. Recent advances in Natural Language Processing (NLP) and Large Language Models (LLMs) now offer the opportunity to automatically evaluate such medical interviews, thereby alleviating the need for human examiners during training. Yet, real French OSCE annotated transcripts remain extremely scarce, limiting reproducible research and reliable benchmarking. To address these challenges, we investigate the use of LLMs for both generating and evaluating French OSCE dialogues in a low-resource context. We introduce a controlled pipeline that produces synthetic doctor-patient interview transcripts guided by scenario-specific evaluation criteria, combining ideal and perturbed performances to simulate varying student skill levels. The resulting dialogues are automatically silver-labeled through an LLM-assisted framework supporting adjustable evaluation strictness. Benchmarking multiple open-source and proprietary LLMs shows that mid-size models ($\le$32B parameters) achieve accuracies comparable to GPT-4o ($\sim$90\%) on synthetic data, highlighting the feasibility of locally deployable, privacy-preserving evaluation systems for medical education.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mid-size LLMs reach ~90% on synthetic French OSCE labels matching GPT-4o, but the whole result sits on LLM-generated dialogues and silver labels with no human check.

read the letter

The paper's core move is a controlled pipeline that generates French doctor-patient OSCE transcripts by mixing ideal and deliberately perturbed student performances, then labels them with an adjustable-strictness LLM evaluator. That setup is new for the low-resource French medical-education case and directly tackles the scarcity of real annotated transcripts. They benchmark several open models up to 32B and show they come close to GPT-4o on the synthetic test set, which is a practical data point for anyone thinking about local, privacy-friendly tools for repeated student practice. The adjustable strictness knob is a nice engineering detail that lets users tune how forgiving the feedback is. The obvious limitation is that every accuracy number is computed against labels produced by the same LLM-assisted process that created the dialogues. No human OSCE examiner re-labels any of the synthetic transcripts, and there is no side-by-side comparison with real student recordings. That leaves open the possibility that the high agreement simply reflects shared model biases or prompt patterns rather than clinical fidelity. The circularity concern in the stress-test note holds up on the abstract and reported method. The work is aimed at applied NLP groups that build tools for medical training in under-resourced languages. It gives a reproducible template worth trying, but anyone using it for actual student feedback would still need a human-validation step first. I would send it to peer review because the problem is concrete, the pipeline is described clearly enough to replicate, and the current evidence is modest but not overstated.

Referee Report

2 major / 2 minor

Summary. The paper introduces a controlled LLM pipeline to generate synthetic French OSCE doctor-patient interview transcripts by combining ideal and perturbed student performances according to scenario-specific evaluation criteria, then applies an LLM-assisted silver-labeling framework with adjustable strictness to produce evaluation labels. Benchmarking of open-source and proprietary LLMs on this synthetic data shows that mid-size models (≤32B parameters) reach accuracies of ~90%, comparable to GPT-4o, and argues this enables feasible, privacy-preserving, locally deployable evaluation systems for medical education in low-resource settings.

Significance. The work provides a practical method for addressing the scarcity of annotated French OSCE data through synthetic generation and offers evidence that smaller models can match larger ones on this task, which supports potential deployment in privacy-sensitive clinical training environments. The adjustable-strictness labeling and controlled perturbation approach are constructive contributions that could aid reproducible research if the synthetic data can be shown to align with real clinical standards.

major comments (2)

[Section 5] Section 5 (Benchmarking and Results): The headline accuracies (~90% for ≤32B models matching GPT-4o) are computed exclusively against LLM-generated silver labels produced by the same pipeline that created the dialogues; without any human-expert re-labeling or comparison on a held-out subset of real or synthetic transcripts, the metrics do not establish that the models align with actual OSCE examiner judgments.
[Methods] Methods (synthetic data generation and labeling pipeline): The central claim that the approach alleviates the need for human examiners during training rests on the assumption that LLM silver labels faithfully reflect clinical skills; this assumption is load-bearing yet untested, as no validation against real annotated French OSCE transcripts is reported, leaving open the possibility that high inter-model agreement arises from shared stylistic or training-data biases rather than genuine clinical fidelity.

minor comments (2)

[Abstract] Abstract and Section 4: The description of 'adjustable evaluation strictness' would benefit from an explicit example or parameter table showing how strictness levels affect label distributions and downstream accuracy scores.
[Conclusion] The paper should clarify whether the generation prompts, perturbation rules, and evaluation rubrics are released as supplementary material to support reproducibility claims.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive review. The concerns about reliance on synthetic silver labels without human validation are valid and central to the work's limitations in a low-resource setting. We address each major comment point-by-point below, clarifying our design choices while acknowledging where direct evidence is absent due to data scarcity. We propose targeted revisions to strengthen the discussion of these issues.

read point-by-point responses

Referee: [Section 5] Section 5 (Benchmarking and Results): The headline accuracies (~90% for ≤32B models matching GPT-4o) are computed exclusively against LLM-generated silver labels produced by the same pipeline that created the dialogues; without any human-expert re-labeling or comparison on a held-out subset of real or synthetic transcripts, the metrics do not establish that the models align with actual OSCE examiner judgments.

Authors: We agree that the reported accuracies are measured against silver labels from the generation pipeline rather than human OSCE examiners, and that this does not directly prove alignment with real clinical judgments. Our pipeline constructs dialogues with explicitly defined perturbations drawn from scenario-specific evaluation criteria (e.g., missing history questions, inadequate empathy), so the silver labels encode known skill variations rather than being purely emergent. This controlled setup allows benchmarking of model consistency on the task, which is a necessary first step when real annotated French OSCE data does not exist at scale. We will revise Section 5 to explicitly state that these metrics demonstrate inter-model agreement on synthetic data and add a paragraph on the need for future human validation studies on both synthetic and any available real transcripts. revision: partial
Referee: [Methods] Methods (synthetic data generation and labeling pipeline): The central claim that the approach alleviates the need for human examiners during training rests on the assumption that LLM silver labels faithfully reflect clinical skills; this assumption is load-bearing yet untested, as no validation against real annotated French OSCE transcripts is reported, leaving open the possibility that high inter-model agreement arises from shared stylistic or training-data biases rather than genuine clinical fidelity.

Authors: The assumption that silver labels capture clinical skills is indeed load-bearing and remains untested against real French OSCE transcripts, as none are publicly available or accessible in sufficient quantity for this low-resource language. We mitigate bias risks by grounding both generation and labeling in published medical education rubrics and by using adjustable strictness parameters to simulate varying examiner standards. The comparable performance of mid-size open models to GPT-4o on this data supports the practical claim of enabling local, privacy-preserving systems, even if the labels are synthetic. We will add a new limitations subsection in the Methods and Discussion sections that directly addresses the risk of shared LLM biases and outlines a roadmap for human-expert validation once small real datasets become available. revision: partial

standing simulated objections not resolved

Direct empirical validation of the LLM silver labels against human expert annotations on real French OSCE transcripts, which is impossible at present because no such annotated corpus exists at usable scale for this low-resource setting.

Circularity Check

0 steps flagged

No circularity: empirical benchmark on synthetic silver-labeled data

full rationale

The paper describes a data-generation pipeline that produces synthetic French OSCE dialogues from scenario criteria and then applies an LLM-assisted silver-labeling step before benchmarking model accuracies on those labels. No equations, fitted parameters, or derivations are presented that reduce to their own inputs by construction. No self-citations are invoked to establish uniqueness theorems or to smuggle ansatzes. The reported ~90% accuracies are direct empirical measurements against the silver labels; while this raises separate questions of external validity, it does not constitute a self-definitional loop or a prediction that is statistically forced by the labeling process itself. The work is therefore self-contained as an empirical feasibility study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the pipeline likely involves implicit choices in scenario definition and strictness levels but details are absent.

pith-pipeline@v0.9.0 · 5518 in / 1046 out tokens · 55317 ms · 2026-05-10T17:25:14.443778+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 5 canonical work pages · 2 internal anchors

[1]

LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs

Introduction Effective clinical and communication skills are es- sential in healthcare practice, where doctor–patient interviews form the foundation of diagnosis, treat- ment, and patient trust. However, training opportu- nities for these skills remain limited, primarily due to the dependence on human participants, which increases costs and reduces access...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

LLM-as-a-judge

Related Work AI for Pedagogical Assessmenthas been in- creasingly applied in education, with demonstrated benefits for both learning outcomes and scalability. Studies have reported measurable improvements instudentoutcomesandefficiency,includingreduc- tions in examiner workload (Alizadeh and Sameri, 2025). Simulation-based education has also seen the depl...

2025
[3]

strictly evaluate

Data Generation 3.1. LLM-Based Dialogue Generation ToassessthefeasibilityofautomatingOSCEevalu- ation with LLMs, transcribed doctor–patient interac- tions are required, as they provide the textual input from which the model can infer, for each evalua- tion criterion, a binary outcome indicating whether it was met or not (see Fig. 2). However, no pub- Figu...

2024
[4]

accuracy

Experimental Setup The overall evaluation workflow is illustrated in Fig- ure 2. For each clinical case, the input to the evalu- ationsystemconsistsofthefulltranscript, asingle criterion, and a defaulttask descriptionserving as the evaluation prompt. The task description in- structs the LLM to strictly evaluate (similar tostrict mode, see Section 3.2) the...

2022
[5]

in- quiresAorB

Results and Discussion To assess the feasibility of LLM-based automated OSCE evaluation, we report the overall binary clas- sification accuracy across all 179 evaluated criteria, for both theperturbedandunperturbedcorpora (Table 2). Some results are omitted for brevity. Overall model performance:Large industry- leading models such asGPT-4o and Claude 4 So...

work page arXiv 2025
[6]

off-script

Conclusion In this work, we developed a controlled pipeline for generating synthetic French OSCE training tran- scripts alongside an automated evaluation frame- work for clinical-skills criteria based on locally hostable LLMs. By structuring dialogue generation around evaluation criteria and incorporating per- turbations to simulate less idealized student...

work page arXiv 2025
[7]

A Survey on LLM-as-a-Judge

Benchmarking generative ai for scoring medicalstudentinterviewsinobjectivestructured clinical examinations (osces). InArtificial Intel- ligence in Education, pages 231–245, Cham. Springer Nature Switzerland. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, XuehaoZhai,ChengjinXu,WeiLi,YinghanShen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhu...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

https: //sante.gouv.fr/soins-et-maladies/ qualite-securite-et-pertinence-des-soins/ securite-des-prises-en-charge/ securite-des-soins-et-des-patients/ article/identitovigilance

Identitovigilance. https: //sante.gouv.fr/soins-et-maladies/ qualite-securite-et-pertinence-des-soins/ securite-des-prises-en-charge/ securite-des-soins-et-des-patients/ article/identitovigilance. National Library of Medicine (US). 2024. UMLS Knowledge Sources, Release 2024AA. http://www.nlm.nih.gov/ research/umls/licensedcontent/ umlsknowledgesources.htm...

2024
[9]

InProceedings of the 58th Annual MeetingoftheAssociationforComputationalLin- guistics, pages 7881–7892, Online

BLEURT: Learning robust metrics for text generation. InProceedings of the 58th Annual MeetingoftheAssociationforComputationalLin- guistics, pages 7881–7892, Online. Association for Computational Linguistics. H. M. T. W. Seneviratne and S. S. Manathunga
[10]

Ameer Hamza Shakur, Michael J

Artificial intelligence assisted automated short answer question scoring tool shows high correlationwithhumanexaminermarkings.BMC Medical Education, 25(1):1146. Ameer Hamza Shakur, Michael J. Holcomb, David Hein, Shinyoung Kang, Thomas O. Dalton, Krys- tle K. Campbell, Daniel J. Scott, and Andrew R. Jamieson. 2024. Large language models for medical osce a...

work page arXiv 2024