KG20C & KG20C-QA: Scholarly Knowledge Graph Benchmarks for Link Prediction and Question Answering

๐Ÿ“… 2025-12-25
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
A high-quality, benchmark knowledge graph (KG) tailored for link prediction and natural language question answering remains absent in academia. To address this gap, we introduce KG20Cโ€”the first peer-reviewed academic KG derived from the Microsoft Academic Graphโ€”along with its companion question-answering benchmark, KG20C-QA. Methodologically, we design generalizable, structured QA templates to enable unified evaluation of both graph-based models and large language models; further, we employ rigorous data curation, formal schema definition, and automated conversion of triples into QA pairs. Our contributions are threefold: (1) the first open-source, fully reproducible academic KG benchmark; (2) empirical analysis revealing significant performance disparities among state-of-the-art KG embedding methods (e.g., TransE, RotatE) on fine-grained relation types; and (3) a comprehensive evaluation protocol and extensible resources to advance knowledge-driven AI research in scholarly domains.

Technology Category

Application Category

๐Ÿ“ Abstract
In this paper, we present KG20C and KG20C-QA, two curated datasets for advancing question answering (QA) research on scholarly data. KG20C is a high-quality scholarly knowledge graph constructed from the Microsoft Academic Graph through targeted selection of venues, quality-based filtering, and schema definition. Although KG20C has been available online in non-peer-reviewed sources such as GitHub repository, this paper provides the first formal, peer-reviewed description of the dataset, including clear documentation of its construction and specifications. KG20C-QA is built upon KG20C to support QA tasks on scholarly data. We define a set of QA templates that convert graph triples into natural language question--answer pairs, producing a benchmark that can be used both with graph-based models such as knowledge graph embeddings and with text-based models such as large language models. We benchmark standard knowledge graph embedding methods on KG20C-QA, analyze performance across relation types, and provide reproducible evaluation protocols. By officially releasing these datasets with thorough documentation, we aim to contribute a reusable, extensible resource for the research community, enabling future work in QA, reasoning, and knowledge-driven applications in the scholarly domain. The full datasets will be released at https://github.com/tranhungnghiep/KG20C/ upon paper publication.
Problem

Research questions and friction points this paper is trying to address.

Creating scholarly knowledge graph datasets for link prediction and QA tasks.
Providing formal documentation and reproducible benchmarks for research evaluation.
Enabling QA and reasoning applications in the scholarly domain.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed scholarly knowledge graph from Microsoft Academic Graph
Created QA benchmark converting triples to question-answer pairs
Provided reproducible evaluation protocols for embedding methods
Hung-Nghiep Tran
Hung-Nghiep Tran
National Institute of Informatics, Japan
knowledge graphstensor methodsinformation retrievalmachine learningAI
A
Atsuhiro Takasu
National Institute of Informatics, Tokyo, Japan The Graduate University for Advanced Studies, SOKENDAI, Japan