MoDA: Multi-modal Diffusion Architecture for Talking Head Generation
By: Xinyang Li , Gen Li , Zhihui Lin and more
Potential Business Impact:
Makes talking avatars look and move realistically.
Talking head generation with arbitrary identities and speech audio remains a crucial problem in the realm of the virtual metaverse. Recently, diffusion models have become a popular generative technique in this field with their strong generation capabilities. However, several challenges remain for diffusion-based methods: 1) inefficient inference and visual artifacts caused by the implicit latent space of Variational Auto-Encoders (VAE), which complicates the diffusion process; 2) a lack of authentic facial expressions and head movements due to inadequate multi-modal information fusion. In this paper, MoDA handles these challenges by: 1) defining a joint parameter space that bridges motion generation and neural rendering, and leveraging flow matching to simplify diffusion learning; 2) introducing a multi-modal diffusion architecture to model the interaction among noisy motion, audio, and auxiliary conditions, enhancing overall facial expressiveness. In addition, a coarse-to-fine fusion strategy is employed to progressively integrate different modalities, ensuring effective feature fusion. Experimental results demonstrate that MoDA improves video diversity, realism, and efficiency, making it suitable for real-world applications. Project Page: https://lixinyyang.github.io/MoDA.github.io/
Similar Papers
MoDA: Multi-modal Diffusion Architecture for Talking Head Generation
Graphics
Creates realistic talking avatars from speech.
MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation
CV and Pattern Recognition
Makes cartoon mouths move like real people.
Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion
CV and Pattern Recognition
Makes computer animations move like real people.