Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition
By: Yifei Zhang , Chang Liu , Jin Wei and more
Potential Business Impact:
Helps computers read blurry text by using words.
Text images are unique in their dual nature, encompassing both visual and linguistic information. The visual component encompasses structural and appearance-based features, while the linguistic dimension incorporates contextual and semantic elements. In scenarios with degraded visual quality, linguistic patterns serve as crucial supplements for comprehension, highlighting the necessity of integrating both aspects for robust scene text recognition (STR). Contemporary STR approaches often use language models or semantic reasoning modules to capture linguistic features, typically requiring large-scale annotated datasets. Self-supervised learning, which lacks annotations, presents challenges in disentangling linguistic features related to the global context. Typically, sequence contrastive learning emphasizes the alignment of local features, while masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge. In this paper, we propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch. Specifically, we design a linguistics alignment module to extract vision-independent features as linguistic guidance using inputs with different visual appearances. As features extend beyond mere visual structures, LMIM must consider the global context to achieve reconstruction. Extensive experiments on various benchmarks quantitatively demonstrate our state-of-the-art performance, and attention visualizations qualitatively show the simultaneous capture of both visual and linguistic information.
Similar Papers
Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies
CV and Pattern Recognition
Helps computers read messy text from real photos.
MIMRS: A Survey on Masked Image Modeling in Remote Sensing
CV and Pattern Recognition
Fixes cloudy satellite pictures for better maps.
Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection
CV and Pattern Recognition
Helps computers understand pictures better by focusing on important parts.