IndicDLP: A Foundational Dataset for Multi-Lingual and Multi-Domain Document Layout Parsing

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing document layout analysis datasets suffer from three key limitations: lack of fine-grained annotations, insufficient multilingual coverage—particularly for Indian languages—and difficulty balancing scale with domain diversity. To address these, we introduce IndicDLP—the first large-scale, multilingual (11 Indian languages plus English), multi-domain (12 categories) fine-grained document layout parsing benchmark, comprising over 100,000 real-world pages with pixel-level region annotations. We further construct UED-mini, a pretraining-enhanced dataset, via cross-dataset alignment between DocLayNet and M6Doc followed by human verification to ensure high-quality semi-automatic labeling. Our approach significantly improves model transferability: it yields a +28.4% mAP gain on Indic documents and delivers substantial generalization gains—even for English models fine-tuned on IndicDLP—when evaluated on non-Indic documents. IndicDLP is currently the largest open-source multilingual document layout dataset, filling a critical gap in the field.

Technology Category

Application Category

📝 Abstract

Document layout analysis is essential for downstream tasks such as information retrieval, extraction, OCR, and digitization. However, existing large-scale datasets like PubLayNet and DocBank lack fine-grained region labels and multilingual diversity, making them insufficient for representing complex document layouts. In contrast, human-annotated datasets such as M6Doc and D4LA offer richer labels and greater domain diversity, but are too small to train robust models and lack adequate multilingual coverage. This gap is especially pronounced for Indic documents, which encompass diverse scripts yet remain underrepresented in current datasets, further limiting progress in this space. To address these shortcomings, we introduce IndicDLP, a large-scale foundational document layout dataset spanning 11 representative Indic languages alongside English and 12 common document domains. Additionally, we curate UED-mini, a dataset derived from DocLayNet and M6Doc, to enhance pretraining and provide a solid foundation for Indic layout models. Our experiments demonstrate that fine-tuning existing English models on IndicDLP significantly boosts performance, validating its effectiveness. Moreover, models trained on IndicDLP generalize well beyond Indic layouts, making it a valuable resource for document digitization. This work bridges gaps in scale, diversity, and annotation granularity, driving inclusive and efficient document understanding.

Problem

Research questions and friction points this paper is trying to address.

Lack of fine-grained labels and multilingual diversity in existing document layout datasets.

Insufficient representation of complex Indic document layouts across diverse scripts.

Need for large-scale, diverse datasets to train robust multilingual layout parsing models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces IndicDLP dataset for multilingual document layout parsing

Curates UED-mini dataset to enhance pretraining for Indic models

Fine-tunes existing models on IndicDLP to boost performance

🔎 Similar Papers

No similar papers found.

Authors to Follow