TextMamba: Scene Text Detector with Mamba

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the issue in scene text detection where Transformer encoders often lose critical information and attend to irrelevant representations when modeling long-range dependencies, this paper proposes a novel detection framework integrating the Mamba state-space model with selective attention. Methodologically, we design a dual-scale feed-forward network and an embedding pyramid enhancement module to jointly improve long-range modeling capability and multi-scale feature fusion; additionally, a Top-k sparse selection mechanism is introduced to suppress interference and enable efficient global perception. The core contribution lies in synergistically combining Mamba’s selective state modeling with attention mechanisms—complementing their respective strengths—to overcome the representational bottleneck of pure Transformers on long sequences. Our method achieves state-of-the-art or highly competitive F-measures of 89.7% on CTW1500, 89.2% on TotalText, and 78.5% on ICDAR2019-ArT.

Technology Category

Application Category

📝 Abstract
In scene text detection, Transformer-based methods have addressed the global feature extraction limitations inherent in traditional convolution neural network-based methods. However, most directly rely on native Transformer attention layers as encoders without evaluating their cross-domain limitations and inherent shortcomings: forgetting important information or focusing on irrelevant representations when modeling long-range dependencies for text detection. The recently proposed state space model Mamba has demonstrated better long-range dependencies modeling through a linear complexity selection mechanism. Therefore, we propose a novel scene text detector based on Mamba that integrates the selection mechanism with attention layers, enhancing the encoder's ability to extract relevant information from long sequences. We adopt the Top_k algorithm to explicitly select key information and reduce the interference of irrelevant information in Mamba modeling. Additionally, we design a dual-scale feed-forward network and an embedding pyramid enhancement module to facilitate high-dimensional hidden state interactions and multi-scale feature fusion. Our method achieves state-of-the-art or competitive performance on various benchmarks, with F-measures of 89.7%, 89.2%, and 78.5% on CTW1500, TotalText, and ICDAR19ArT, respectively. Codes will be available.
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of Transformer-based scene text detection in long-range dependency modeling
Proposes a Mamba-based detector integrating selection mechanisms to enhance relevant information extraction
Introduces modules for multi-scale feature fusion to improve text detection accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Mamba's selection mechanism with attention layers
Uses Top_k algorithm to select key information and reduce interference
Designs dual-scale feed-forward network and embedding pyramid enhancement module
🔎 Similar Papers
No similar papers found.