Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice
By: Hugo Bohy , Minh Tran , Kevin El Haddad and more
Potential Business Impact:
Helps computers understand people talking and acting.
Human social behaviors are inherently multimodal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. We demonstrate the effectiveness of this model by finetuning and evaluating the model on different social and affective downstream tasks, namely, emotion recognition, laughter detection and apparent personality estimation. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent personality estimation, demonstrating the effectiveness of in-domain self-supervised pre-training. Code and model weight are available here https://github.com/HuBohy/SocialMAE.
Similar Papers
MultiMAE for Brain MRIs: Robustness to Missing Inputs Using Multi-Modal Masked Autoencoder
CV and Pattern Recognition
Fixes missing brain scans for better medical pictures.
SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation
Artificial Intelligence
Makes videos of people talking match sound.
CoMA: Complementary Masking and Hierarchical Dynamic Multi-Window Self-Attention in a Unified Pre-training Framework
CV and Pattern Recognition
Teaches computers to see faster and better.