LLM-Based Data Generation and Clinical Skills Evaluation for Low-Resource French OSCEs

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limitations of Objective Structured Clinical Examinations (OSCEs) in French-speaking medical education, which are constrained by scarce human and computational resources, insufficient practice opportunities, and a lack of annotated real-world French physician–patient dialogues. To overcome these challenges, the authors propose a large language model–based controllable synthesis approach that integrates scenario guidance and performance perturbation to generate French OSCE dialogues aligned with official scoring rubrics. They further develop an automated evaluation framework with adjustable strictness and introduce a silver-label auto-annotation mechanism. Using only open-source models with ≤32B parameters, their method achieves approximately 90% evaluation accuracy—comparable to GPT-4o—demonstrating the feasibility of deploying localized, privacy-preserving systems for medical education assessment.
📝 Abstract
Objective Structured Clinical Examinations (OSCEs) are the standard method for assessing medical students' clinical and communication skills through structured patient interviews. In France, however, the organization of training sessions is limited by human and logistical constraints, restricting students' access to repeated practice and structured feedback. Recent advances in Natural Language Processing (NLP) and Large Language Models (LLMs) now offer the opportunity to automatically evaluate such medical interviews, thereby alleviating the need for human examiners during training. Yet, real French OSCE annotated transcripts remain extremely scarce, limiting reproducible research and reliable benchmarking. To address these challenges, we investigate the use of LLMs for both generating and evaluating French OSCE dialogues in a low-resource context. We introduce a controlled pipeline that produces synthetic doctor-patient interview transcripts guided by scenario-specific evaluation criteria, combining ideal and perturbed performances to simulate varying student skill levels. The resulting dialogues are automatically silver-labeled through an LLM-assisted framework supporting adjustable evaluation strictness. Benchmarking multiple open-source and proprietary LLMs shows that mid-size models ($\le$32B parameters) achieve accuracies comparable to GPT-4o ($\sim$90\%) on synthetic data, highlighting the feasibility of locally deployable, privacy-preserving evaluation systems for medical education.
Problem

Research questions and friction points this paper is trying to address.

OSCE
low-resource
clinical skills evaluation
French medical education
data scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based data generation
clinical skills evaluation
low-resource OSCE
synthetic dialogue
silver-labeling
🔎 Similar Papers
No similar papers found.
T
Tian Huang
Université de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France
T
Tom Bourgeade
Université de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France
Irina Illina
Irina Illina
LORIA Inria
speech recognitionacoustic and language modelingsemantics