Mask & Match: Learning to Recognize Handwritten Math with Self-Supervised Attention

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Handwritten mathematical expression recognition (HMER) faces challenges including multi-scale symbols, complex spatial structures, and scarcity of annotated data. To address these, we propose a self-supervised attention network that employs a progressive spatial masking strategy to guide the model to automatically attend to semantically critical regions—such as operators, superscripts, and nested subexpressions—without human supervision. The image encoder is pre-trained via a joint global-local contrastive loss, enhancing structural awareness and robustness to occlusion or missing components. Subsequently, a Transformer-based decoder generates LaTeX sequences end-to-end. Evaluated on the CROHME benchmark, our method outperforms both existing self-supervised and fully supervised approaches without requiring any labeled data. Ablation studies confirm the effectiveness of the proposed masking mechanism and attention modeling in improving structural understanding and resilience to incomplete input.

Technology Category

Application Category

📝 Abstract
Recognizing handwritten mathematical expressions (HMER) is a challenging task due to the inherent two-dimensional structure, varying symbol scales, and complex spatial relationships among symbols. In this paper, we present a self-supervised learning (SSL) framework for HMER that eliminates the need for expensive labeled data. Our approach begins by pretraining an image encoder using a combination of global and local contrastive loss, enabling the model to learn both holistic and fine-grained representations. A key contribution of this work is a novel self-supervised attention network, which is trained using a progressive spatial masking strategy. This attention mechanism is designed to learn semantically meaningful focus regions, such as operators, exponents, and nested mathematical notation, without requiring any supervision. The progressive masking curriculum encourages the network to become increasingly robust to missing or occluded visual information, ultimately improving structural understanding. Our complete pipeline consists of (1) self-supervised pretraining of the encoder, (2) self-supervised attention learning, and (3) supervised fine-tuning with a transformer decoder to generate LATEX sequences. Extensive experiments on CROHME benchmarks demonstrate that our method outperforms existing SSL and fully supervised baselines, validating the effectiveness of our progressive attention mechanism in enhancing HMER performance. Our codebase can be found here.
Problem

Research questions and friction points this paper is trying to address.

Recognizing handwritten math expressions without labeled data
Learning self-supervised attention for mathematical symbols
Improving robustness to missing visual information in HMER
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised learning with progressive masking
Contrastive loss for holistic and fine-grained representations
Unsupervised attention for semantic focus regions
🔎 Similar Papers
No similar papers found.