🤖 AI Summary
This study addresses the poor performance of general-purpose information extraction models in the autoimmune domain, primarily due to the scarcity of domain-adapted annotated data. To bridge this gap, we introduce AAbAAC, the first structured corpus specifically tailored for autoimmune research, comprising expert-annotated disease and antibody entities along with their relationships from 115 PubMed abstracts. Leveraging this resource, we fine-tune and evaluate state-of-the-art named entity recognition models, demonstrating that even a small-scale, high-quality domain-specific dataset can substantially enhance model performance in specialized biomedical contexts. Our results validate the practical utility and domain adaptability of AAbAAC, highlighting its potential to support downstream applications in autoimmune literature mining and knowledge discovery.
📝 Abstract
Despite advances in information extraction driven by deep learning and large language models, performance gaps remain in highly specialized biomedical fields, where domainspecific complexity poses challenges for generalist models. In this work, we focus on the domain of autoimmunity, where the main entities of interest are autoimmune diseases, autoantibodies (i.e., molecules that may mark or cause these diseases), their molecular targets, their location in the body, and their associated clinical signs. Herein, we present AAbAAC (AutoAntibodies and Autoimmunity Annotated Corpus), a corpus of 115 abstracts selected from PubMed, where we manually annotated entities and their relationships. First, AAbAAC was used to evaluate several methods on the task of named entity recognition (NER), and secondly, to fine-tune NER models. Our study demonstrates the utility of AAbAAC for information extraction in the domain of autoimmunity, showing expected improvement in NER performance after finetuning. This illustrates the value of small-scale annotation efforts for specialized domains and contributes to the computational study of autoimmunity. The AAbAAC corpus is available at https://github.com/f-maury/AAbAAC.