🤖 AI Summary
Existing PDF parsing tools are primarily designed for academic papers and struggle to accurately process pedagogical documents—such as legal textbooks—that exhibit complex, implicitly structured hierarchies. To address this, we propose a hierarchical text segmentation framework integrating structure-aware preprocessing with large language models (LLMs). Our method jointly leverages OCR-based heading detection, XML structural feature extraction, and contextual semantic modeling to infer implicit heading hierarchies without requiring explicit table-of-contents input. Compared to pure LLM–based or traditional rule-based approaches, our framework significantly reduces false positives and improves segmentation accuracy. When high-quality metadata is available, a supplementary table-of-contents–driven strategy further enhances performance. The source code and benchmark dataset are publicly released to support reproducible research.
📝 Abstract
The growing demand for effective tools to parse PDF-formatted texts, particularly structured documents such as textbooks, reveals the limitations of current methods developed mainly for research paper segmentation. This work addresses the challenge of hierarchical segmentation in complex structured documents, with a focus on legal textbooks that contain layered knowledge essential for interpreting and applying legal norms. We examine a Table of Contents (TOC)-based technique and approaches that rely on open-source structural parsing tools or Large Language Models (LLMs) operating without explicit TOC input. To enhance parsing accuracy, we incorporate preprocessing strategies such as OCR-based title detection, XML-derived features, and contextual text features. These strategies are evaluated based on their ability to identify section titles, allocate hierarchy levels, and determine section boundaries. Our findings show that combining LLMs with structure-aware preprocessing substantially reduces false positives and improves extraction quality. We also find that when the metadata quality of headings in the PDF is high, TOC-based techniques perform particularly well. All code and data are publicly available to support replication. We conclude with a comparative evaluation of the methods, outlining their respective strengths and limitations.