🤖 AI Summary
To address the high cost of manually constructing high-quality, reusable annotation guidelines and strong task coupling in clinical information extraction, this paper proposes the first zero-shot self-generating annotation guideline method. Leveraging large language models (LLMs), we design a self-improving framework that automatically synthesizes high-quality, cross-task transferable annotation guidelines via knowledge distillation, structured guideline modeling, and zero-shot prompt engineering—requiring no domain expertise or manual authoring. The generated guidelines are plug-and-play for multiple clinical named entity recognition (NER) tasks. Evaluated on four authoritative clinical NER benchmarks, our method improves F1 scores by 0.20–25.86 percentage points over baseline methods; moreover, the automatically generated guidelines outperform human-crafted ones by 1.15–4.14 percentage points. This approach significantly enhances both the efficiency of guideline construction and the generalizability across clinical NER tasks.
📝 Abstract
Generative information extraction using large language models, particularly through few-shot learning, has become a popular method. Recent studies indicate that providing a detailed, human-readable guideline-similar to the annotation guidelines traditionally used for training human annotators can significantly improve performance. However, constructing these guidelines is both labor- and knowledge-intensive. Additionally, the definitions are often tailored to meet specific needs, making them highly task-specific and often non-reusable. Handling these subtle differences requires considerable effort and attention to detail. In this study, we propose a self-improving method that harvests the knowledge summarization and text generation capacity of LLMs to synthesize annotation guidelines while requiring virtually no human input. Our zero-shot experiments on the clinical named entity recognition benchmarks, 2012 i2b2 EVENT, 2012 i2b2 TIMEX, 2014 i2b2, and 2018 n2c2 showed 25.86%, 4.36%, 0.20%, and 7.75% improvements in strict F1 scores from the no-guideline baseline. The LLM-synthesized guidelines showed equivalent or better performance compared to human-written guidelines by 1.15% to 4.14% in most tasks. In conclusion, this study proposes a novel LLM self-improving method that requires minimal knowledge and human input and is applicable to multiple biomedical domains.