DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Document understanding models relying on absolute 2D positional embeddings suffer from poor generalization, high computational overhead, and dependence on massive pretraining data. To address these limitations, this paper proposes DocPolarBERT, a novel architecture that replaces conventional Cartesian-coordinate-based absolute position embeddings with relative polar-coordinate positional encodings. It accordingly redesigns the self-attention mechanism to explicitly model directional and distance relationships among text blocks. By eliminating reliance on a global coordinate system, DocPolarBERT significantly enhances layout awareness. Built upon the BERT backbone, the model achieves state-of-the-art performance on multiple standard document understanding benchmarks—including FUNSD, CORD, and SROIE—despite being pretrained on a dataset orders of magnitude smaller than IIT-CDIP. This demonstrates the feasibility of efficient, layout-aware representation learning with substantially reduced data requirements.

Technology Category

Application Category

📝 Abstract

We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.

Problem

Research questions and friction points this paper is trying to address.

Eliminates need for absolute 2D positional embeddings

Uses relative polar coordinates for text block positions

Compensates reduced pre-training data with better attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses relative polar coordinate encoding

Extends self-attention for layout awareness

Achieves SOTA with smaller pre-training data

🔎 Similar Papers

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding