đ€ AI Summary
This work addresses the stringent constraints imposed by data sensitivity and privacy regulations on clinical natural language processing in the European Union by introducing a novel method for generating synthetic clinical corpora through structured protocols. Leveraging the epidemiological distributions from Franceâs National Health Data System (SNDS), predefined case scenarios, and specialty-specific templates, medical experts collaboratively authored and peer-reviewed fully anonymized yet clinically realistic French-language clinical reports. The resulting open-source corpus comprises 7,394 reports representing 5,009 fictional patients across 18 medical specialties. It supports both general-purpose modeling and specialized information extraction tasksâincluding oncology, infectious disease, and diagnostic codingâand establishes a reproducible, cross-lingual framework for synthetic clinical corpus construction.
đ Abstract
The development of clinical natural language processing (NLP) systems is severely hampered by the sensitive nature of medical records, which restricts data sharing under stringent privacy regulations, particularly in France and the broader European Union. To address this gap, we introduce PARHAF, a large open-source corpus of clinical documents in French. PARHAF comprises expert-authored clinical reports describing realistic yet entirely fictitious patient cases, making it anonymous and freely shareable by design. The corpus was developed using a structured protocol that combined clinician expertise with epidemiological guidance from the French National Health Data System (SNDS), ensuring broad clinical coverage. A total of 104 medical residents across 18 specialties authored and peer-reviewed the reports following predefined clinical scenarios and document templates.
The corpus contains 7394 clinical reports covering 5009 patient cases across a wide range of medical and surgical specialties. It includes a general-purpose component designed to approximate real-world hospitalization distributions, and four specialized subsets that support information-extraction use cases in oncology, infectious diseases, and diagnostic coding. Documents are released under a CC-BY open license, with a portion temporarily embargoed to enable future benchmarking under controlled conditions.
PARHAF provides a valuable resource for training and evaluating French clinical language models in a fully privacy-preserving setting, and establishes a replicable methodology for building shareable synthetic clinical corpora in other languages and health systems.