🤖 AI Summary
Amidst the exponential growth of scientific literature, researchers urgently require efficient tools for literature understanding and discovery. This paper introduces Semantic Scholar’s open academic knowledge graph construction paradigm: a novel, fully automated pipeline integrating multi-source data, high-precision PDF parsing, fine-grained structured semantic annotation, NLP-driven natural language summarization, and context-aware embedding representation learning. The resulting open academic graph—the largest to date—comprises over 200 million papers, 80 million authors, and 2.4 billion citations, hosted on a dynamically updatable “living document”–style platform architecture. We publicly release the Semantic Scholar Academic Graph alongside standardized APIs, establishing it as a globally adopted open research infrastructure. This framework significantly enhances the efficiency and effectiveness of scholarly information retrieval, comprehension, and knowledge synthesis.
📝 Abstract
The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-theart techniques for scholarly PDF content extraction and automatic knowledge graph construction to build the Semantic Scholar Academic Graph, the largest open scientific literature graph to-date, with 200M+ papers, 80M+ authors, 550M+ paper-authorship edges, and 2.4B+ citation edges. The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings. In this paper, we describe the components of the S2 data processing pipeline and the associated APIs offered by the platform. We will update this living document to reflect changes as we add new data offerings and improve existing services.