HiPS: Hierarchical PDF Segmentation of Textbooks

📅 2025-08-31

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

Existing PDF parsing tools are primarily designed for academic papers and struggle to accurately process pedagogical documents—such as legal textbooks—that exhibit complex, implicitly structured hierarchies. To address this, we propose a hierarchical text segmentation framework integrating structure-aware preprocessing with large language models (LLMs). Our method jointly leverages OCR-based heading detection, XML structural feature extraction, and contextual semantic modeling to infer implicit heading hierarchies without requiring explicit table-of-contents input. Compared to pure LLM–based or traditional rule-based approaches, our framework significantly reduces false positives and improves segmentation accuracy. When high-quality metadata is available, a supplementary table-of-contents–driven strategy further enhances performance. The source code and benchmark dataset are publicly released to support reproducible research.

Technology Category

Application Category

📝 Abstract

The growing demand for effective tools to parse PDF-formatted texts, particularly structured documents such as textbooks, reveals the limitations of current methods developed mainly for research paper segmentation. This work addresses the challenge of hierarchical segmentation in complex structured documents, with a focus on legal textbooks that contain layered knowledge essential for interpreting and applying legal norms. We examine a Table of Contents (TOC)-based technique and approaches that rely on open-source structural parsing tools or Large Language Models (LLMs) operating without explicit TOC input. To enhance parsing accuracy, we incorporate preprocessing strategies such as OCR-based title detection, XML-derived features, and contextual text features. These strategies are evaluated based on their ability to identify section titles, allocate hierarchy levels, and determine section boundaries. Our findings show that combining LLMs with structure-aware preprocessing substantially reduces false positives and improves extraction quality. We also find that when the metadata quality of headings in the PDF is high, TOC-based techniques perform particularly well. All code and data are publicly available to support replication. We conclude with a comparative evaluation of the methods, outlining their respective strengths and limitations.

Problem

Research questions and friction points this paper is trying to address.

Hierarchical segmentation of complex structured PDF documents

Improving parsing accuracy with preprocessing and LLM integration

Evaluating TOC-based and structure-aware methods for textbook analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining LLMs with structure-aware preprocessing

OCR-based title detection and XML-derived features

TOC-based techniques for high-quality metadata headings

🔎 Similar Papers

READoc: A Unified Benchmark for Realistic Document Structured Extraction