PARHAF, a human-authored corpus of clinical reports for fictitious patients in French

📅 2026-03-20
📈 Citations: 0
✹ Influential: 0
📄 PDF
đŸ€– AI Summary
This work addresses the stringent constraints imposed by data sensitivity and privacy regulations on clinical natural language processing in the European Union by introducing a novel method for generating synthetic clinical corpora through structured protocols. Leveraging the epidemiological distributions from France’s National Health Data System (SNDS), predefined case scenarios, and specialty-specific templates, medical experts collaboratively authored and peer-reviewed fully anonymized yet clinically realistic French-language clinical reports. The resulting open-source corpus comprises 7,394 reports representing 5,009 fictional patients across 18 medical specialties. It supports both general-purpose modeling and specialized information extraction tasks—including oncology, infectious disease, and diagnostic coding—and establishes a reproducible, cross-lingual framework for synthetic clinical corpus construction.

Technology Category

Application Category

📝 Abstract
The development of clinical natural language processing (NLP) systems is severely hampered by the sensitive nature of medical records, which restricts data sharing under stringent privacy regulations, particularly in France and the broader European Union. To address this gap, we introduce PARHAF, a large open-source corpus of clinical documents in French. PARHAF comprises expert-authored clinical reports describing realistic yet entirely fictitious patient cases, making it anonymous and freely shareable by design. The corpus was developed using a structured protocol that combined clinician expertise with epidemiological guidance from the French National Health Data System (SNDS), ensuring broad clinical coverage. A total of 104 medical residents across 18 specialties authored and peer-reviewed the reports following predefined clinical scenarios and document templates. The corpus contains 7394 clinical reports covering 5009 patient cases across a wide range of medical and surgical specialties. It includes a general-purpose component designed to approximate real-world hospitalization distributions, and four specialized subsets that support information-extraction use cases in oncology, infectious diseases, and diagnostic coding. Documents are released under a CC-BY open license, with a portion temporarily embargoed to enable future benchmarking under controlled conditions. PARHAF provides a valuable resource for training and evaluating French clinical language models in a fully privacy-preserving setting, and establishes a replicable methodology for building shareable synthetic clinical corpora in other languages and health systems.
Problem

Research questions and friction points this paper is trying to address.

clinical NLP
data privacy
medical records
French clinical corpus
data sharing
Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic clinical corpus
privacy-preserving NLP
expert-authored medical reports
French clinical language processing
shareable medical data
🔎 Similar Papers
No similar papers found.
Xavier Tannier
Xavier Tannier
Sorbonne Université, Limics
Natural Language ProcessingInformation ExtractionBioNLP
S
Salam Abbara
Université Paris-Saclay, UVSQ, Assistance Publique-HÎpitaux de Paris, Raymond Poincaré University Hospital, Infectious Disease Department, Garches, France; Yonsei University College of Medicine, Gangnam Severance Hospital, Department of Laboratory Medicine, Seoul, South Korea
R
Rémi Flicoteaux
Assistance Publique-HĂŽpitaux de Paris, Department of medical information, Paris, France
Y
Youness Khalil
Health Data Hub, 75015, Paris, France
A
Aurélie Névéol
Université Paris-Saclay, CNRS, LISN, 91400, Orsay, France
Pierre Zweigenbaum
Pierre Zweigenbaum
Senior Researcher, LISN, CNRS, Université Paris-Saclay, Orsay, France (formerly LIMSI)
Natural Language ProcessingBiomedical InformaticsBioNLPComputational LinguisticsArtificial
Emmanuel Bacry
Emmanuel Bacry
CNRS Ecole Polytechnique
Self-similarityMultifractalStochastic modelingStatistical financeFinancial time-series modelization