TextBite: A Historical Czech Document Dataset for Logical Page Segmentation

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses logical page segmentation in historical Czech documents, proposing a purely image-based approach to eliminate OCR dependency and mitigate geometric distortion effects on evaluation. We introduce the first large-scale historical Czech logical layout dataset—spanning newspapers, dictionaries, and manuscripts from the 18th–20th centuries—comprising 8,449 pages and 78,863 paragraph-level annotations. We formally define the image-level logical segmentation task and propose a foreground-text-pixel-only evaluation paradigm. Methodologically, we decouple logical segmentation from OCR and geometric priors: integrating U-Net or Mask R-CNN for text detection, graph neural networks to model inter-paragraph semantic relationships, and handwriting-robust preprocessing. Our method achieves a paragraph-level F1 score of 0.82 on TextBite, outperforming OCR-dependent baselines by 17%. We publicly release the dataset, code, and evaluation framework, establishing the first dedicated benchmark for Central European historical document analysis.

Technology Category

Application Category

📝 Abstract

Logical page segmentation is an important step in document analysis, enabling better semantic representations, information retrieval, and text understanding. Previous approaches define logical segmentation either through text or geometric objects, relying on OCR or precise geometry. To avoid the need for OCR, we define the task purely as segmentation in the image domain. Furthermore, to ensure the evaluation remains unaffected by geometrical variations that do not impact text segmentation, we propose to use only foreground text pixels in the evaluation metric and disregard all background pixels. To support research in logical document segmentation, we introduce TextBite, a dataset of historical Czech documents spanning the 18th to 20th centuries, featuring diverse layouts from newspapers, dictionaries, and handwritten records. The dataset comprises 8,449 page images with 78,863 annotated segments of logically and thematically coherent text. We propose a set of baseline methods combining text region detection and relation prediction. The dataset, baselines and evaluation framework can be accessed at https://github.com/DCGM/textbite-dataset.

Problem

Research questions and friction points this paper is trying to address.

Logical page segmentation without relying on OCR

Evaluation metric using only foreground text pixels

Dataset for historical Czech document segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Image domain segmentation without OCR

Foreground text pixel evaluation metric

Baseline methods combining detection and prediction

🔎 Similar Papers

Chronicling Germany: An Annotated Historical Newspaper Dataset