🤖 AI Summary
This paper addresses key challenges in sentence-level knowledge triplet extraction—namely, coreference ambiguity, syntactic nesting, and rare relation identification—within complex sentences. To this end, we propose CoDe-KG, the first open-source, end-to-end framework for sentence-level knowledge graph construction. Its core innovation lies in jointly modeling coreference resolution and dependency-driven syntactic decomposition, augmented by a hybrid chain-of-thought prompting strategy integrated with few-shot learning, all optimized on a human-annotated, complexity-stratified corpus. Experiments demonstrate state-of-the-art performance: CoDe-KG achieves 65.8% macro-F1 and 75.7% micro-F1 on the REBEL and WebNLG2 benchmarks—substantially outperforming prior methods—and improves recall for rare relations by over 20%. Additionally, we publicly release a high-quality benchmark dataset comprising over 150,000 triplets to support future research.
📝 Abstract
We introduce CoDe-KG, an open-source, end-to-end pipeline for extracting sentence-level knowledge graphs by combining robust coreference resolution with syntactic sentence decomposition. Using our model, we contribute a dataset of over 150,000 knowledge triples, which is open source. We also contribute a training corpus of 7248 rows for sentence complexity, 190 rows of gold human annotations for co-reference resolution using open source lung-cancer abstracts from PubMed, 900 rows of gold human annotations for sentence conversion policies, and 398 triples of gold human annotations. We systematically select optimal prompt-model pairs across five complexity categories, showing that hybrid chain-of-thought and few-shot prompting yields up to 99.8% exact-match accuracy on sentence simplification. On relation extraction (RE), our pipeline achieves 65.8% macro-F1 on REBEL, an 8-point gain over the prior state of the art, and 75.7% micro-F1 on WebNLG2, while matching or exceeding performance on Wiki-NRE and CaRB. Ablation studies demonstrate that integrating coreference and decomposition increases recall on rare relations by over 20%. Code and dataset are available at https://github.com/KaushikMahmud/CoDe-KG_EMNLP_2025