Detection of Adverse Drug Events in Dutch clinical free text documents using Transformer Models: benchmark study

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

A reliable benchmark for adverse drug event (ADE) detection in Dutch clinical free-text is lacking. Method: We introduce the first benchmark framework for ADE recognition in Dutch clinical settings, evaluating both end-to-end and two-stage (entity recognition + relation classification) paradigms. We systematically assess Bi-LSTM and four Transformer models—BERTje, RobBERT, MedRoBERTa.nl, and NuNER—using clinically adapted evaluation metrics and rigorous internal validation plus external document-level validation. Contribution/Results: MedRoBERTa.nl achieves a macro-F1 score of 0.63 on internal testing; in external document-level validation, it attains ADE recall of 67–74%, substantially outperforming prior approaches. This work establishes a reproducible, clinically meaningful, standardized benchmark for evaluating low-resource medical language models in Dutch.

Technology Category

Application Category

📝 Abstract

In this study, we set a benchmark for adverse drug event (ADE) detection in Dutch clinical free text documents using several transformer models, clinical scenarios and fit-for-purpose performance measures. We trained a Bidirectional Long Short-Term Memory (Bi-LSTM) model and four transformer-based Dutch and/or multilingual encoder models (BERTje, RobBERT, MedRoBERTa.nl, and NuNER) for the tasks of named entity recognition (NER) and relation classification (RC) using 102 richly annotated Dutch ICU clinical progress notes. Anonymized free text clinical progress notes of patients admitted to intensive care unit (ICU) of one academic hospital and discharge letters of patients admitted to Internal Medicine wards of two non-academic hospitals were reused. We evaluated our ADE RC models internally using gold standard (two-step task) and predicted entities (end-to-end task). In addition, all models were externally validated on detecting ADEs at the document level. We report both micro- and macro-averaged F1 scores, given the imbalance of ADEs in the datasets. Although differences for the ADE RC task between the models were small, MedRoBERTa.nl was the best performing model with macro-averaged F1 score of 0.63 using gold standard and 0.62 using predicted entities. The MedRoBERTa.nl models also performed the best in our external validation and achieved recall of between 0.67 to 0.74 using predicted entities, meaning between 67 to 74% of discharge letters with ADEs were detected. Our benchmark study presents a robust and clinically meaningful approach for evaluating language models for ADE detection in clinical free text documents. Our study highlights the need to use appropriate performance measures fit for the task of ADE detection in clinical free-text documents and envisioned future clinical use.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking ADE detection in Dutch clinical texts using transformers.

Evaluating NER and RC models for ADE identification in ICU notes.

Assessing model performance with clinical measures and external validation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Used transformer models for ADE detection

Benchmarked Dutch clinical text processing

Applied MedRoBERTa.nl for best performance

🔎 Similar Papers

No similar papers found.

Authors to Follow