Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

To address visual feature degradation and the difficulty of vision-language co-modeling under self-supervision in low-quality scene text recognition (STR), this paper proposes a Language-Aware Masked Image Modeling (LA-MIM) framework. LA-MIM introduces a language alignment module that guides the MIM decoding process in a fully unsupervised manner, enabling language-invariant visual feature extraction and global linguistic constraint-driven image reconstruction—thereby overcoming the limitation of conventional MIM methods to local visual modeling. The framework integrates a dual-branch architecture, cross-modal attention, and self-supervised contrastive learning to jointly model character morphology and linguistic semantics. It achieves state-of-the-art performance across multiple STR benchmarks. Attention visualization further demonstrates that the model simultaneously captures fine-grained character structures and high-level linguistic context, validating its effective multimodal integration.

Technology Category

Application Category

📝 Abstract

Text images are unique in their dual nature, encompassing both visual and linguistic information. The visual component encompasses structural and appearance-based features, while the linguistic dimension incorporates contextual and semantic elements. In scenarios with degraded visual quality, linguistic patterns serve as crucial supplements for comprehension, highlighting the necessity of integrating both aspects for robust scene text recognition (STR). Contemporary STR approaches often use language models or semantic reasoning modules to capture linguistic features, typically requiring large-scale annotated datasets. Self-supervised learning, which lacks annotations, presents challenges in disentangling linguistic features related to the global context. Typically, sequence contrastive learning emphasizes the alignment of local features, while masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge. In this paper, we propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch. Specifically, we design a linguistics alignment module to extract vision-independent features as linguistic guidance using inputs with different visual appearances. As features extend beyond mere visual structures, LMIM must consider the global context to achieve reconstruction. Extensive experiments on various benchmarks quantitatively demonstrate our state-of-the-art performance, and attention visualizations qualitatively show the simultaneous capture of both visual and linguistic information.

Problem

Research questions and friction points this paper is trying to address.

Integrating visual and linguistic features for robust text recognition

Overcoming annotation scarcity in self-supervised linguistic feature learning

Enhancing masked image modeling with global contextual linguistics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linguistics-aware Masked Image Modeling (LMIM) approach

Separate branch channels linguistic into MIM decoding

Linguistics alignment module extracts vision-independent features

🔎 Similar Papers

No similar papers found.