TextMamba: Scene Text Detector with Mamba
By: Qiyan Zhao, Yue Yan, Da-Han Wang
Potential Business Impact:
Helps computers find words in messy pictures.
In scene text detection, Transformer-based methods have addressed the global feature extraction limitations inherent in traditional convolution neural network-based methods. However, most directly rely on native Transformer attention layers as encoders without evaluating their cross-domain limitations and inherent shortcomings: forgetting important information or focusing on irrelevant representations when modeling long-range dependencies for text detection. The recently proposed state space model Mamba has demonstrated better long-range dependencies modeling through a linear complexity selection mechanism. Therefore, we propose a novel scene text detector based on Mamba that integrates the selection mechanism with attention layers, enhancing the encoder's ability to extract relevant information from long sequences. We adopt the Top\_k algorithm to explicitly select key information and reduce the interference of irrelevant information in Mamba modeling. Additionally, we design a dual-scale feed-forward network and an embedding pyramid enhancement module to facilitate high-dimensional hidden state interactions and multi-scale feature fusion. Our method achieves state-of-the-art or competitive performance on various benchmarks, with F-measures of 89.7\%, 89.2\%, and 78.5\% on CTW1500, TotalText, and ICDAR19ArT, respectively. Codes will be available.
Similar Papers
PathMamba: A Hybrid Mamba-Transformer for Topologically Coherent Road Segmentation in Satellite Imagery
CV and Pattern Recognition
Maps roads better using faster, smarter computer vision.
HybridMamba: A Dual-domain Mamba for 3D Medical Image Segmentation
CV and Pattern Recognition
Helps doctors see inside bodies better.
VCMamba: Bridging Convolutions with Multi-Directional Mamba for Efficient Visual Representation
CV and Pattern Recognition
Helps computers see details and the big picture.