🤖 AI Summary
This work investigates the feasibility and key challenges of deploying large language models (LLMs) as substitutes for human annotators in biomedical text mining, focusing on three core difficulties: implicit domain adaptation, discriminative task-format constraints, and strict adherence to annotation guidelines. To address these, we propose a guideline-aware dynamic instruction extraction pipeline that integrates in-context learning prompt engineering, structured parsing of annotation guidelines, and LLM-to-BERT knowledge distillation—enabling zero-shot and few-shot biomedical entity and relation extraction. Experiments demonstrate that state-of-the-art LLMs match or surpass SOTA BERT-based models across multiple tasks; moreover, the distilled BERT models achieve practical accuracy using only synthetic annotations generated by LLMs. This study is the first to empirically validate the viability and effectiveness of LLM-assisted annotation without model fine-tuning and with minimal human annotation effort, establishing a novel paradigm for biomedical NLP annotation.
📝 Abstract
Large language models (LLMs) can perform various natural language processing (NLP) tasks through in-context learning without relying on supervised data. However, multiple previous studies have reported suboptimal performance of LLMs in biological text mining. By analyzing failure patterns in these evaluations, we identified three primary challenges for LLMs in biomedical corpora: (1) LLMs fail to learn implicit dataset-specific nuances from supervised data, (2) The common formatting requirements of discriminative tasks limit the reasoning capabilities of LLMs particularly for LLMs that lack test-time compute, and (3) LLMs struggle to adhere to annotation guidelines and match exact schemas, which hinders their ability to understand detailed annotation requirements which is essential in biomedical annotation workflow. To address these challenges, we experimented with prompt engineering techniques targeted to the above issues, and developed a pipeline that dynamically extracts instructions from annotation guidelines. Our findings show that frontier LLMs can approach or surpass the performance of state-of-the-art (SOTA) BERT-based models with minimal reliance on manually annotated data and without fine-tuning. Furthermore, we performed model distillation on a closed-source LLM, demonstrating that a BERT model trained exclusively on synthetic data annotated by LLMs can also achieve a practical performance. Based on these results, we explored the feasibility of partially replacing manual annotation with LLMs in production scenarios for biomedical text mining.