Taming Transformer for Emotion-Controllable Talking Face Generation
By: Ziqi Zhang, Cheng Deng
Potential Business Impact:
Makes videos of people talking with emotions.
Talking face generation is a novel and challenging generation task, aiming at synthesizing a vivid speaking-face video given a specific audio. To fulfill emotion-controllable talking face generation, current methods need to overcome two challenges: One is how to effectively model the multimodal relationship related to the specific emotion, and the other is how to leverage this relationship to synthesize identity preserving emotional videos. In this paper, we propose a novel method to tackle the emotion-controllable talking face generation task discretely. Specifically, we employ two pre-training strategies to disentangle audio into independent components and quantize videos into combinations of visual tokens. Subsequently, we propose the emotion-anchor (EA) representation that integrates the emotional information into visual tokens. Finally, we introduce an autoregressive transformer to model the global distribution of the visual tokens under the given conditions and further predict the index sequence for synthesizing the manipulated videos. We conduct experiments on the MEAD dataset that controls the emotion of videos conditioned on multiple emotional audios. Extensive experiments demonstrate the superiorities of our method both qualitatively and quantitatively.
Similar Papers
RealTalk: Realistic Emotion-Aware Lifelike Talking-Head Synthesis
CV and Pattern Recognition
Makes computer faces show real feelings from voices.
EAI-Avatar: Emotion-Aware Interactive Talking Head Generation
Audio and Speech Processing
Makes talking robots show real feelings.
Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation
CV and Pattern Recognition
Makes talking videos show real feelings, not fake ones.