PARHAF, a human-authored corpus of clinical reports for fictitious patients in French

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the stringent constraints imposed by data sensitivity and privacy regulations on clinical natural language processing in the European Union by introducing a novel method for generating synthetic clinical corpora through structured protocols. Leveraging the epidemiological distributions from France’s National Health Data System (SNDS), predefined case scenarios, and specialty-specific templates, medical experts collaboratively authored and peer-reviewed fully anonymized yet clinically realistic French-language clinical reports. The resulting open-source corpus comprises 7,394 reports representing 5,009 fictional patients across 18 medical specialties. It supports both general-purpose modeling and specialized information extraction tasks—including oncology, infectious disease, and diagnostic coding—and establishes a reproducible, cross-lingual framework for synthetic clinical corpus construction.

Technology Category

Application Category

📝 Abstract

The development of clinical natural language processing (NLP) systems is severely hampered by the sensitive nature of medical records, which restricts data sharing under stringent privacy regulations, particularly in France and the broader European Union. To address this gap, we introduce PARHAF, a large open-source corpus of clinical documents in French. PARHAF comprises expert-authored clinical reports describing realistic yet entirely fictitious patient cases, making it anonymous and freely shareable by design. The corpus was developed using a structured protocol that combined clinician expertise with epidemiological guidance from the French National Health Data System (SNDS), ensuring broad clinical coverage. A total of 104 medical residents across 18 specialties authored and peer-reviewed the reports following predefined clinical scenarios and document templates. The corpus contains 7394 clinical reports covering 5009 patient cases across a wide range of medical and surgical specialties. It includes a general-purpose component designed to approximate real-world hospitalization distributions, and four specialized subsets that support information-extraction use cases in oncology, infectious diseases, and diagnostic coding. Documents are released under a CC-BY open license, with a portion temporarily embargoed to enable future benchmarking under controlled conditions. PARHAF provides a valuable resource for training and evaluating French clinical language models in a fully privacy-preserving setting, and establishes a replicable methodology for building shareable synthetic clinical corpora in other languages and health systems.

Problem

Research questions and friction points this paper is trying to address.

clinical NLP

data privacy

medical records

French clinical corpus

data sharing

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic clinical corpus

privacy-preserving NLP

expert-authored medical reports