🤖 AI Summary
Document understanding models relying on absolute 2D positional embeddings suffer from poor generalization, high computational overhead, and dependence on massive pretraining data. To address these limitations, this paper proposes DocPolarBERT, a novel architecture that replaces conventional Cartesian-coordinate-based absolute position embeddings with relative polar-coordinate positional encodings. It accordingly redesigns the self-attention mechanism to explicitly model directional and distance relationships among text blocks. By eliminating reliance on a global coordinate system, DocPolarBERT significantly enhances layout awareness. Built upon the BERT backbone, the model achieves state-of-the-art performance on multiple standard document understanding benchmarks—including FUNSD, CORD, and SROIE—despite being pretrained on a dataset orders of magnitude smaller than IIT-CDIP. This demonstrates the feasibility of efficient, layout-aware representation learning with substantially reduced data requirements.
📝 Abstract
We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.