Automated Knowledge Graph Construction using Large Language Models and Sentence Complexity Modelling

📅 2025-09-21

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This paper addresses key challenges in sentence-level knowledge triplet extraction—namely, coreference ambiguity, syntactic nesting, and rare relation identification—within complex sentences. To this end, we propose CoDe-KG, the first open-source, end-to-end framework for sentence-level knowledge graph construction. Its core innovation lies in jointly modeling coreference resolution and dependency-driven syntactic decomposition, augmented by a hybrid chain-of-thought prompting strategy integrated with few-shot learning, all optimized on a human-annotated, complexity-stratified corpus. Experiments demonstrate state-of-the-art performance: CoDe-KG achieves 65.8% macro-F1 and 75.7% micro-F1 on the REBEL and WebNLG2 benchmarks—substantially outperforming prior methods—and improves recall for rare relations by over 20%. Additionally, we publicly release a high-quality benchmark dataset comprising over 150,000 triplets to support future research.

Technology Category

Application Category

📝 Abstract

We introduce CoDe-KG, an open-source, end-to-end pipeline for extracting sentence-level knowledge graphs by combining robust coreference resolution with syntactic sentence decomposition. Using our model, we contribute a dataset of over 150,000 knowledge triples, which is open source. We also contribute a training corpus of 7248 rows for sentence complexity, 190 rows of gold human annotations for co-reference resolution using open source lung-cancer abstracts from PubMed, 900 rows of gold human annotations for sentence conversion policies, and 398 triples of gold human annotations. We systematically select optimal prompt-model pairs across five complexity categories, showing that hybrid chain-of-thought and few-shot prompting yields up to 99.8% exact-match accuracy on sentence simplification. On relation extraction (RE), our pipeline achieves 65.8% macro-F1 on REBEL, an 8-point gain over the prior state of the art, and 75.7% micro-F1 on WebNLG2, while matching or exceeding performance on Wiki-NRE and CaRB. Ablation studies demonstrate that integrating coreference and decomposition increases recall on rare relations by over 20%. Code and dataset are available at https://github.com/KaushikMahmud/CoDe-KG_EMNLP_2025

Problem

Research questions and friction points this paper is trying to address.

Automating knowledge graph construction from text using language models

Improving relation extraction accuracy through sentence complexity modeling

Enhancing recall on rare relations via coreference resolution integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining coreference resolution with syntactic decomposition

Using hybrid chain-of-thought and few-shot prompting

Achieving state-of-the-art relation extraction performance

🔎 Similar Papers

No similar papers found.