๐ค AI Summary
A high-quality, benchmark knowledge graph (KG) tailored for link prediction and natural language question answering remains absent in academia. To address this gap, we introduce KG20Cโthe first peer-reviewed academic KG derived from the Microsoft Academic Graphโalong with its companion question-answering benchmark, KG20C-QA. Methodologically, we design generalizable, structured QA templates to enable unified evaluation of both graph-based models and large language models; further, we employ rigorous data curation, formal schema definition, and automated conversion of triples into QA pairs. Our contributions are threefold: (1) the first open-source, fully reproducible academic KG benchmark; (2) empirical analysis revealing significant performance disparities among state-of-the-art KG embedding methods (e.g., TransE, RotatE) on fine-grained relation types; and (3) a comprehensive evaluation protocol and extensible resources to advance knowledge-driven AI research in scholarly domains.
๐ Abstract
In this paper, we present KG20C and KG20C-QA, two curated datasets for advancing question answering (QA) research on scholarly data. KG20C is a high-quality scholarly knowledge graph constructed from the Microsoft Academic Graph through targeted selection of venues, quality-based filtering, and schema definition. Although KG20C has been available online in non-peer-reviewed sources such as GitHub repository, this paper provides the first formal, peer-reviewed description of the dataset, including clear documentation of its construction and specifications. KG20C-QA is built upon KG20C to support QA tasks on scholarly data. We define a set of QA templates that convert graph triples into natural language question--answer pairs, producing a benchmark that can be used both with graph-based models such as knowledge graph embeddings and with text-based models such as large language models. We benchmark standard knowledge graph embedding methods on KG20C-QA, analyze performance across relation types, and provide reproducible evaluation protocols. By officially releasing these datasets with thorough documentation, we aim to contribute a reusable, extensible resource for the research community, enabling future work in QA, reasoning, and knowledge-driven applications in the scholarly domain. The full datasets will be released at https://github.com/tranhungnghiep/KG20C/ upon paper publication.