DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning

📅 2025-04-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Document image segmentation faces challenges of weak generalization and resource redundancy due to task diversity (e.g., layout analysis, multi-granularity text segmentation, table structure recognition) and format heterogeneity. To address this, we propose the first unified framework for document image segmentation, jointly modeling all tasks as collaborative instance and semantic segmentation. Our method introduces Sentence-BERT–driven semantic queries and a novel dual-path interaction mechanism between semantic and instance queries, enabling cross-task and cross-dataset joint training. Built upon a Transformer architecture, the framework integrates image-query cross-attention with a dot-product classifier for unified prediction. Extensive experiments demonstrate significant improvements over state-of-the-art methods across multiple benchmarks—achieving higher accuracy, faster inference, and stronger cross-domain generalization. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Document image segmentation is crucial for document analysis and recognition but remains challenging due to the diversity of document formats and segmentation tasks. Existing methods often address these tasks separately, resulting in limited generalization and resource wastage. This paper introduces DocSAM, a transformer-based unified framework designed for various document image segmentation tasks, such as document layout analysis, multi-granularity text segmentation, and table structure recognition, by modelling these tasks as a combination of instance and semantic segmentation. Specifically, DocSAM employs Sentence-BERT to map category names from each dataset into semantic queries that match the dimensionality of instance queries. These two sets of queries interact through an attention mechanism and are cross-attended with image features to predict instance and semantic segmentation masks. Instance categories are predicted by computing the dot product between instance and semantic queries, followed by softmax normalization of scores. Consequently, DocSAM can be jointly trained on heterogeneous datasets, enhancing robustness and generalization while reducing computational and storage resources. Comprehensive evaluations show that DocSAM surpasses existing methods in accuracy, efficiency, and adaptability, highlighting its potential for advancing document image understanding and segmentation across various applications. Codes are available at https://github.com/xhli-git/DocSAM.
Problem

Research questions and friction points this paper is trying to address.

Unified framework for diverse document image segmentation tasks
Addresses limited generalization and resource wastage in existing methods
Enhances robustness via joint training on heterogeneous datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based unified framework for document segmentation
Sentence-BERT maps category names to semantic queries
Joint training on heterogeneous datasets enhances robustness
🔎 Similar Papers
No similar papers found.
Xiao-Hui Li
Xiao-Hui Li
Huawei;The Hong Kong University of Science and Technology
Multimodal Large Language Modelsexplainable artificial intelligencePhysics
F
Fei Yin
MAIS, Institute of Automation of Chinese Academy of Sciences, Beijing, 100190, China
C
Cheng-Lin Liu
MAIS, Institute of Automation of Chinese Academy of Sciences, Beijing, 100190, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, China