PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing document layout detection models face bottlenecks in generalization, robustness to complex layouts, and real-time inference. To address these challenges, we propose PP-DocLayout—a lightweight, balanced, and high-accuracy three-tier architecture—capable of end-to-end, cross-format detection of 23 layout element classes (e.g., titles, paragraphs, tables, formulas). Built upon an enhanced RT-DETR backbone, PP-DocLayout integrates document-specific priors and multi-scale feature fusion, while supporting heterogeneous acceleration on both GPU and CPU. PP-DocLayout-L achieves 90.4% mAP@0.5 at 13.4 ms/page on an NVIDIA T4 GPU; the smaller PP-DocLayout-S variant processes pages in just 8.1 ms (T4) or 14.5 ms (CPU), significantly improving throughput for large-scale document auto-annotation. All code and pre-trained models are publicly released and have been deployed in production systems.

Technology Category

Application Category

📝 Abstract

Document layout analysis is a critical preprocessing step in document intelligence, enabling the detection and localization of structural elements such as titles, text blocks, tables, and formulas. Despite its importance, existing layout detection models face significant challenges in generalizing across diverse document types, handling complex layouts, and achieving real-time performance for large-scale data processing. To address these limitations, we present PP-DocLayout, which achieves high precision and efficiency in recognizing 23 types of layout regions across diverse document formats. To meet different needs, we offer three models of varying scales. PP-DocLayout-L is a high-precision model based on the RT-DETR-L detector, achieving 90.4% mAP@0.5 and an end-to-end inference time of 13.4 ms per page on a T4 GPU. PP-DocLayout-M is a balanced model, offering 75.2% mAP@0.5 with an inference time of 12.7 ms per page on a T4 GPU. PP-DocLayout-S is a high-efficiency model designed for resource-constrained environments and real-time applications, with an inference time of 8.1 ms per page on a T4 GPU and 14.5 ms on a CPU. This work not only advances the state of the art in document layout analysis but also provides a robust solution for constructing high-quality training data, enabling advancements in document intelligence and multimodal AI systems. Code and models are available at https://github.com/PaddlePaddle/PaddleX .

Problem

Research questions and friction points this paper is trying to address.

Improves document layout detection across diverse formats

Addresses real-time performance for large-scale data processing

Offers scalable models for varying precision and efficiency needs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified model detects 23 layout types

Three scalable models for diverse needs

Real-time performance on T4 GPU

🔎 Similar Papers

READoc: A Unified Benchmark for Realistic Document Structured Extraction