🤖 AI Summary
Traditional biomedical concept identification relies on explicit mentions, limiting its ability to capture implicit concepts and hindering ontology refinement and knowledge graph construction. To address this, we propose an ontology-driven index-then-identify paradigm that decouples concept identification into semantic indexing and end-to-end recognition. We introduce the semantic search ID (ssID) to resolve ontology ambiguity, leverage LLM-generated queries and synthetic data for mention-free training, and integrate BART fine-tuning, ontology alignment, and vectorized retrieval. Evaluated on CDR, HPO, and HOIP benchmarks, our method achieves significant improvements in both explicit and implicit concept identification while reducing computational overhead and enabling low-barrier deployment by domain experts. Our core contribution is the first unified, fully ontology-guided concept identification framework that operates without mention-level annotations.
📝 Abstract
Recognizing biomedical concepts in the text is vital for ontology refinement, knowledge graph construction, and concept relationship discovery. However, traditional concept recognition methods, relying on explicit mention identification, often fail to capture complex concepts not explicitly stated in the text. To overcome this limitation, we introduce MA-COIR, a framework that reformulates concept recognition as an indexing-recognition task. By assigning semantic search indexes (ssIDs) to concepts, MA-COIR resolves ambiguities in ontology entries and enhances recognition efficiency. Using a pretrained BART-based model fine-tuned on small datasets, our approach reduces computational requirements to facilitate adoption by domain experts. Furthermore, we incorporate large language models (LLMs)-generated queries and synthetic data to improve recognition in low-resource settings. Experimental results on three scenarios (CDR, HPO, and HOIP) highlight the effectiveness of MA-COIR in recognizing both explicit and implicit concepts without the need for mention-level annotations during inference, advancing ontology-driven concept recognition in biomedical domain applications. Our code and constructed data are available at https://github.com/sl-633/macoir-master.