GPTFace: Generative Pre-training of Facial-Linguistic Transformer by Span Masking and Weakly Correlated Text-image Data
By: Yudong Li , Hao Li , Xianxu Hou and more
Potential Business Impact:
Teaches computers to understand and change faces.
Compared to the prosperity of pre-training models in natural image understanding, the research on large-scale pre-training models for facial knowledge learning is still limited. Current approaches mainly rely on manually assembled and annotated face datasets for training, but labeling such datasets is labor-intensive and the trained models have limited scalability beyond the training data. To address these limitations, we present a generative pre-training model for facial knowledge learning that leverages large-scale web-built data for training. We use texts and images containing human faces crawled from the internet and conduct pre-training on self-supervised tasks, including masked image/language modeling (MILM) and image-text matching (ITM). During the generation stage, we further utilize the image-text matching loss to pull the generation distribution towards the control signal for controllable image/text generation. Experimental results demonstrate that our model achieves comparable performance to state-of-the-art pre-training models for various facial downstream tasks, such as attribution classification and expression recognition. Furthermore, our approach is also applicable to a wide range of face editing tasks, including face attribute editing, expression manipulation, mask removal, and photo inpainting.
Similar Papers
Learning from Gene Names, Expression Values and Images: Contrastive Masked Text-Image Pretraining for Spatial Transcriptomics Representation Learning
CV and Pattern Recognition
Helps scientists see which genes are active in cells.
Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval
CV and Pattern Recognition
Helps computers recognize people in photos better.
Multivariate Diffusion Transformer with Decoupled Attention for High-Fidelity Mask-Text Collaborative Facial Generation
CV and Pattern Recognition
Creates realistic faces from masks and words.