Low-resource Information Extraction with the European Clinical Case Corpus

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

To address the scarcity of annotated data for disease–test relationship extraction in low-resource language clinical notes, this paper introduces E3C-3.0, a multilingual clinical relation extraction dataset covering ten languages. We propose a novel semi-automatic construction paradigm—“LLM-based annotation projection + human verification”—to efficiently generate high-quality labels. Methodologically, we integrate cross-lingual LLM annotation projection, multilingual fine-tuning, and cross-lingual transfer learning, significantly improving mainstream LLMs’ relation extraction performance on low-resource languages (average F1 gain of 12.7%). Empirical evaluation confirms that projected annotations achieve quality comparable to native human annotations, and cross-lingual transfer demonstrates strong generalization across languages. All data, annotation protocols, and trained models are publicly released, establishing a reproducible benchmark and practical toolkit for low-resource clinical information extraction.

Technology Category

Application Category

📝 Abstract

We present E3C-3.0, a multilingual dataset in the medical domain, comprising clinical cases annotated with diseases and test-result relations. The dataset includes both native texts in five languages (English, French, Italian, Spanish and Basque) and texts translated and projected from the English source into five target languages (Greek, Italian, Polish, Slovak, and Slovenian). A semi-automatic approach has been implemented, including automatic annotation projection based on Large Language Models (LLMs) and human revision. We present several experiments showing that current state-of-the-art LLMs can benefit from being fine-tuned on the E3C-3.0 dataset. We also show that transfer learning in different languages is very effective, mitigating the scarcity of data. Finally, we compare performance both on native data and on projected data. We release the data at https://huggingface.co/collections/NLP-FBK/e3c-projected-676a7d6221608d60e4e9fd89 .

Problem

Research questions and friction points this paper is trying to address.

Addresses low-resource information extraction in clinical texts

Enhances multilingual medical data via annotation projection

Improves LLM performance through fine-tuning and transfer learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual clinical dataset with disease annotations

Semi-automatic annotation using LLMs and human revision

Transfer learning mitigates data scarcity effectively

🔎 Similar Papers

No similar papers found.