🤖 AI Summary
This work addresses the scarcity of expert-annotated data in supervised biomedical entity linking by proposing a context-augmented framework based on large language models (LLMs). The approach leverages LLMs to automatically generate context-rich synthetic training examples for candidate concepts from a knowledge base, enabling broad supervision without manual annotation through a decoder-only architecture and guided inference. The study further introduces an innovative LLM-as-a-judge protocol to more accurately assess clinical validity. Evaluated on three multilingual benchmarks—MedMentions, QUAERO, and SPACCC—the method achieves new state-of-the-art performance, matching fully supervised results with only 40% of the annotated data and significantly improving the prediction rate of clinically valid links.
📝 Abstract
We present SynCABEL (Synthetic Contextualized Augmentation for Biomedical Entity Linking), a framework that addresses a central bottleneck in supervised biomedical entity linking (BEL): the scarcity of expert-annotated training data. SynCABEL leverages large language models to generate context-rich synthetic training examples for all candidate concepts in a target knowledge base, providing broad supervision without manual annotation. We demonstrate that SynCABEL, when combined with decoder-only models and guided inference establish new state-of-the-art results across three widely used multilingual benchmarks: MedMentions for English, QUAERO for French, and SPACCC for Spanish. Evaluating data efficiency, we show that SynCABEL reaches the performance of full human supervision using up to 60% less annotated data, substantially reducing reliance on labor-intensive and costly expert labeling. Finally, acknowledging that standard evaluation based on exact code matching often underestimates clinically valid predictions due to ontology redundancy, we introduce an LLM-as-a-judge protocol. This analysis reveals that SynCABEL significantly improves the rate of clinically valid predictions. Our synthetic datasets, models, and code are released to support reproducibility and future research.