Score: 1

Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition

Published: March 24, 2025 | arXiv ID: 2503.18746v1

By: Yifei Zhang , Chang Liu , Jin Wei and more

Potential Business Impact:

Helps computers read blurry text by using words.

Business Areas:

Image Recognition Data and Analytics, Software

Text images are unique in their dual nature, encompassing both visual and linguistic information. The visual component encompasses structural and appearance-based features, while the linguistic dimension incorporates contextual and semantic elements. In scenarios with degraded visual quality, linguistic patterns serve as crucial supplements for comprehension, highlighting the necessity of integrating both aspects for robust scene text recognition (STR). Contemporary STR approaches often use language models or semantic reasoning modules to capture linguistic features, typically requiring large-scale annotated datasets. Self-supervised learning, which lacks annotations, presents challenges in disentangling linguistic features related to the global context. Typically, sequence contrastive learning emphasizes the alignment of local features, while masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge. In this paper, we propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch. Specifically, we design a linguistics alignment module to extract vision-independent features as linguistic guidance using inputs with different visual appearances. As features extend beyond mere visual structures, LMIM must consider the global context to achieve reconstruction. Extensive experiments on various benchmarks quantitatively demonstrate our state-of-the-art performance, and attention visualizations qualitatively show the simultaneous capture of both visual and linguistic information.

Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies

CV and Pattern Recognition

Helps computers read messy text from real photos.

11 May 2025 2

89%

MIMRS: A Survey on Masked Image Modeling in Remote Sensing

CV and Pattern Recognition

Fixes cloudy satellite pictures for better maps.

4 Apr 2025 0

88%

Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection

CV and Pattern Recognition

Helps computers understand pictures better by focusing on important parts.

14 Mar 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

11 pages

Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition

Helps computers read blurry text by using words.

Technical Abstract

Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies

MIMRS: A Survey on Masked Image Modeling in Remote Sensing

Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection