Can Frontier LLMs Replace Annotators in Biomedical Text Mining? Analyzing Challenges and Exploring Solutions

📅 2025-03-05

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work investigates the feasibility and key challenges of deploying large language models (LLMs) as substitutes for human annotators in biomedical text mining, focusing on three core difficulties: implicit domain adaptation, discriminative task-format constraints, and strict adherence to annotation guidelines. To address these, we propose a guideline-aware dynamic instruction extraction pipeline that integrates in-context learning prompt engineering, structured parsing of annotation guidelines, and LLM-to-BERT knowledge distillation—enabling zero-shot and few-shot biomedical entity and relation extraction. Experiments demonstrate that state-of-the-art LLMs match or surpass SOTA BERT-based models across multiple tasks; moreover, the distilled BERT models achieve practical accuracy using only synthetic annotations generated by LLMs. This study is the first to empirically validate the viability and effectiveness of LLM-assisted annotation without model fine-tuning and with minimal human annotation effort, establishing a novel paradigm for biomedical NLP annotation.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) can perform various natural language processing (NLP) tasks through in-context learning without relying on supervised data. However, multiple previous studies have reported suboptimal performance of LLMs in biological text mining. By analyzing failure patterns in these evaluations, we identified three primary challenges for LLMs in biomedical corpora: (1) LLMs fail to learn implicit dataset-specific nuances from supervised data, (2) The common formatting requirements of discriminative tasks limit the reasoning capabilities of LLMs particularly for LLMs that lack test-time compute, and (3) LLMs struggle to adhere to annotation guidelines and match exact schemas, which hinders their ability to understand detailed annotation requirements which is essential in biomedical annotation workflow. To address these challenges, we experimented with prompt engineering techniques targeted to the above issues, and developed a pipeline that dynamically extracts instructions from annotation guidelines. Our findings show that frontier LLMs can approach or surpass the performance of state-of-the-art (SOTA) BERT-based models with minimal reliance on manually annotated data and without fine-tuning. Furthermore, we performed model distillation on a closed-source LLM, demonstrating that a BERT model trained exclusively on synthetic data annotated by LLMs can also achieve a practical performance. Based on these results, we explored the feasibility of partially replacing manual annotation with LLMs in production scenarios for biomedical text mining.

Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with implicit dataset-specific nuances in biomedical text mining.

Formatting requirements limit LLMs' reasoning in discriminative tasks.

LLMs fail to adhere to detailed biomedical annotation guidelines.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic instruction extraction from annotation guidelines

Prompt engineering to enhance LLM performance

Model distillation using synthetic LLM-annotated data

🔎 Similar Papers

Benchmarking large language models for biomedical natural language processing applications and recommendations