MA-COIR: Leveraging Semantic Search Index and Generative Models for Ontology-Driven Biomedical Concept Recognition

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Traditional biomedical concept identification relies on explicit mentions, limiting its ability to capture implicit concepts and hindering ontology refinement and knowledge graph construction. To address this, we propose an ontology-driven index-then-identify paradigm that decouples concept identification into semantic indexing and end-to-end recognition. We introduce the semantic search ID (ssID) to resolve ontology ambiguity, leverage LLM-generated queries and synthetic data for mention-free training, and integrate BART fine-tuning, ontology alignment, and vectorized retrieval. Evaluated on CDR, HPO, and HOIP benchmarks, our method achieves significant improvements in both explicit and implicit concept identification while reducing computational overhead and enabling low-barrier deployment by domain experts. Our core contribution is the first unified, fully ontology-guided concept identification framework that operates without mention-level annotations.

Technology Category

Application Category

📝 Abstract

Recognizing biomedical concepts in the text is vital for ontology refinement, knowledge graph construction, and concept relationship discovery. However, traditional concept recognition methods, relying on explicit mention identification, often fail to capture complex concepts not explicitly stated in the text. To overcome this limitation, we introduce MA-COIR, a framework that reformulates concept recognition as an indexing-recognition task. By assigning semantic search indexes (ssIDs) to concepts, MA-COIR resolves ambiguities in ontology entries and enhances recognition efficiency. Using a pretrained BART-based model fine-tuned on small datasets, our approach reduces computational requirements to facilitate adoption by domain experts. Furthermore, we incorporate large language models (LLMs)-generated queries and synthetic data to improve recognition in low-resource settings. Experimental results on three scenarios (CDR, HPO, and HOIP) highlight the effectiveness of MA-COIR in recognizing both explicit and implicit concepts without the need for mention-level annotations during inference, advancing ontology-driven concept recognition in biomedical domain applications. Our code and constructed data are available at https://github.com/sl-633/macoir-master.

Problem

Research questions and friction points this paper is trying to address.

Recognizing implicit biomedical concepts not explicitly stated in text

Reducing computational needs for ontology-driven concept recognition

Improving concept recognition in low-resource biomedical settings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic search indexes for concept disambiguation

BART-based model fine-tuned on small datasets

LLM-generated queries for low-resource settings

🔎 Similar Papers

A Simplified Retriever to Improve Accuracy of Phenotype Normalizations by Large Language Models