🤖 AI Summary
Short technical documents (e.g., IBM Technotes) suffer from knowledge sparsity and implicit structural organization, hindering effective knowledge extraction and downstream applications. Method: This paper proposes a micro-knowledge graph (micrograph) construction framework that jointly leverages DOM parsing and lightweight NLP to extract all entities, actions, and their precise positional coordinates within a page. It introduces a domain-specific, semi-structured schema tailored for technical documentation, enabling step-aware procedural identification and structure-aware modeling. Contribution/Results: We formally define and implement the first fine-grained, page-structure-aligned knowledge graph representation, preserving both semantic and spatial relationships. Evaluated on real-world Technotes, our micrographs achieve high construction accuracy; downstream question answering and retrieval tasks show significant improvements in both accuracy and interpretability, while knowledge coverage increases by over threefold compared to conventional knowledge graphs.
📝 Abstract
Short technical support pages such as IBM Technotes are quite common in technical support domain. These pages can be very useful as the knowledge sources for technical support applications such as chatbots, search engines and question-answering (QA) systems. Information extracted from documents to drive technical support applications is often stored in the form of Knowledge Graph (KG). Building KGs from a large corpus of documents poses a challenge of granularity because a large number of entities and actions are present in each page. The KG becomes virtually unusable if all entities and actions from these pages are stored in the KG. Therefore, only key entities and actions from each page are extracted and stored in the KG. This approach however leads to loss of knowledge represented by entities and actions left out of the KG as they are no longer available to graph search and reasoning functions. We propose a set of techniques to create micro knowledge graph (micrograph) for each of such web pages. The micrograph stores all the entities and actions in a page and also takes advantage of the structure of the page to represent exactly in which part of that page these entities and actions appeared, and also how they relate to each other. These micrographs can be used as additional knowledge sources by technical support applications. We define schemas for representing semi-structured and plain text knowledge present in the technical support web pages. Solutions in technical support domain include procedures made of steps. We also propose a technique to extract procedures from these webpages and the schemas to represent them in the micrographs. We also discuss how technical support applications can take advantage of the micrographs.