🤖 AI Summary
Biomedical named entity recognition (NER) faces challenges including open-ended entity type sets, scarce annotated data, and continual emergence of novel concepts—limiting conventional methods that rely on fixed label inventories and strong supervision. This paper proposes the first zero-shot and few-shot framework for open-domain biomedical NER. Our approach introduces a novel large language model (LLM)-guided knowledge distillation method to generate high-coverage synthetic training data, integrated with the dual-encoder GLiNER architecture. We further design biomedical-specific pretraining and fine-tuning strategies enabling natural-language prompt-driven generalization. Evaluated across multiple benchmarks, our method significantly outperforms the original GLiNER, achieving absolute F1-score gains of up to 5.96% in zero-shot and few-shot settings. All models, synthetic datasets, and source code are publicly released.
📝 Abstract
Biomedical named entity recognition (NER) presents unique challenges due to specialized vocabularies, the sheer volume of entities, and the continuous emergence of novel entities. Traditional NER models, constrained by fixed taxonomies and human annotations, struggle to generalize beyond predefined entity types or efficiently adapt to emerging concepts. To address these issues, we introduce GLiNER-biomed, a domain-adapted suite of Generalist and Lightweight Model for NER (GLiNER) models specifically tailored for biomedical NER. In contrast to conventional approaches, GLiNER uses natural language descriptions to infer arbitrary entity types, enabling zero-shot recognition. Our approach first distills the annotation capabilities of large language models (LLMs) into a smaller, more efficient model, enabling the generation of high-coverage synthetic biomedical NER data. We subsequently train two GLiNER architectures, uni- and bi-encoder, at multiple scales to balance computational efficiency and recognition performance. Evaluations on several biomedical datasets demonstrate that GLiNER-biomed outperforms state-of-the-art GLiNER models in both zero- and few-shot scenarios, achieving 5.96% improvement in F1-score over the strongest baseline. Ablation studies highlight the effectiveness of our synthetic data generation strategy and emphasize the complementary benefits of synthetic biomedical pre-training combined with fine-tuning on high-quality general-domain annotations. All datasets, models, and training pipelines are publicly available at https://github.com/ds4dh/GLiNER-biomed.