🤖 AI Summary
To address the issue in scene text detection where Transformer encoders often lose critical information and attend to irrelevant representations when modeling long-range dependencies, this paper proposes a novel detection framework integrating the Mamba state-space model with selective attention. Methodologically, we design a dual-scale feed-forward network and an embedding pyramid enhancement module to jointly improve long-range modeling capability and multi-scale feature fusion; additionally, a Top-k sparse selection mechanism is introduced to suppress interference and enable efficient global perception. The core contribution lies in synergistically combining Mamba’s selective state modeling with attention mechanisms—complementing their respective strengths—to overcome the representational bottleneck of pure Transformers on long sequences. Our method achieves state-of-the-art or highly competitive F-measures of 89.7% on CTW1500, 89.2% on TotalText, and 78.5% on ICDAR2019-ArT.
📝 Abstract
In scene text detection, Transformer-based methods have addressed the global feature extraction limitations inherent in traditional convolution neural network-based methods. However, most directly rely on native Transformer attention layers as encoders without evaluating their cross-domain limitations and inherent shortcomings: forgetting important information or focusing on irrelevant representations when modeling long-range dependencies for text detection. The recently proposed state space model Mamba has demonstrated better long-range dependencies modeling through a linear complexity selection mechanism. Therefore, we propose a novel scene text detector based on Mamba that integrates the selection mechanism with attention layers, enhancing the encoder's ability to extract relevant information from long sequences. We adopt the Top_k algorithm to explicitly select key information and reduce the interference of irrelevant information in Mamba modeling. Additionally, we design a dual-scale feed-forward network and an embedding pyramid enhancement module to facilitate high-dimensional hidden state interactions and multi-scale feature fusion. Our method achieves state-of-the-art or competitive performance on various benchmarks, with F-measures of 89.7%, 89.2%, and 78.5% on CTW1500, TotalText, and ICDAR19ArT, respectively. Codes will be available.