Score: 2

Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies

Published: May 11, 2025 | arXiv ID: 2505.06855v1

By: Zhengmi Tang , Yuto Mitsui , Tomo Miyazaki and more

Potential Business Impact:

Helps computers read messy text from real photos.

Business Areas:

Text Analytics Data and Analytics, Software

Most existing text recognition methods are trained on large-scale synthetic datasets due to the scarcity of labeled real-world datasets. Synthetic images, however, cannot faithfully reproduce real-world scenarios, such as uneven illumination, irregular layout, occlusion, and degradation, resulting in performance disparities when handling complex real-world images. Recent self-supervised learning techniques, notably contrastive learning and masked image modeling (MIM), narrow this domain gap by exploiting unlabeled real text images. This study first analyzes the original Masked AutoEncoder (MAE) and observes that random patch masking predominantly captures low-level textural features but misses high-level contextual representations. To fully exploit the high-level contextual representations, we introduce random blockwise and span masking in the text recognition task. These strategies can mask the continuous image patches and completely remove some characters, forcing the model to infer relationships among characters within a word. Our Multi-Masking Strategy (MMS) integrates random patch, blockwise, and span masking into the MIM frame, which jointly learns low and high-level textual representations. After fine-tuning with real data, MMS outperforms the state-of-the-art self-supervised methods in various text-related tasks, including text recognition, segmentation, and text-image super-resolution.

Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition

CV and Pattern Recognition

Helps computers read blurry text by using words.

24 Mar 2025 1

89%

Evolved Hierarchical Masking for Self-Supervised Learning

CV and Pattern Recognition

Teaches computers to see details better.

12 Apr 2025 0

88%

CoMA: Complementary Masking and Hierarchical Dynamic Multi-Window Self-Attention in a Unified Pre-training Framework

CV and Pattern Recognition

Teaches computers to see faster and better.

8 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇯🇵 Japan

Repos / Data Links

github.com

Page Count

14 pages

Joint Low-level and High-level Textual Representation Learning with Multiple Masking Strategies

Helps computers read messy text from real photos.

Technical Abstract

Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition

Evolved Hierarchical Masking for Self-Supervised Learning

CoMA: Complementary Masking and Hierarchical Dynamic Multi-Window Self-Attention in a Unified Pre-training Framework