CoMA: Complementary Masking and Hierarchical Dynamic Multi-Window Self-Attention in a Unified Pre-training Framework
By: Jiaxuan Li , Qing Xu , Xiangjian He and more
Potential Business Impact:
Teaches computers to see faster and better.
Masked Autoencoders (MAE) achieve self-supervised learning of image representations by randomly removing a portion of visual tokens and reconstructing the original image as a pretext task, thereby significantly enhancing pretraining efficiency and yielding excellent adaptability across downstream tasks. However, MAE and other MAE-style paradigms that adopt random masking generally require more pre-training epochs to maintain adaptability. Meanwhile, ViT in MAE suffers from inefficient parameter use due to fixed spatial resolution across layers. To overcome these limitations, we propose the Complementary Masked Autoencoders (CoMA), which employ a complementary masking strategy to ensure uniform sampling across all pixels, thereby improving effective learning of all features and enhancing the model's adaptability. Furthermore, we introduce DyViT, a hierarchical vision transformer that employs a Dynamic Multi-Window Self-Attention (DM-MSA), significantly reducing the parameters and FLOPs while improving fine-grained feature learning. Pre-trained on ImageNet-1K with CoMA, DyViT matches the downstream performance of MAE using only 12% of the pre-training epochs, demonstrating more effective learning. It also attains a 10% reduction in pre-training time per epoch, further underscoring its superior pre-training efficiency.
Similar Papers
MuM: Multi-View Masked Image Modeling for 3D Vision
CV and Pattern Recognition
Teaches computers to understand 3D from many pictures.
Structure is Supervision: Multiview Masked Autoencoders for Radiology
CV and Pattern Recognition
Helps doctors find diseases in X-rays better.
Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice
CV and Pattern Recognition
Helps computers understand people talking and acting.