M3G: Multi-Granular Gesture Generator for Audio-Driven Full-Body Human Motion Synthesis
By: Zhizhuo Yin, Yuk Hang Tsui, Pan Hui
Potential Business Impact:
Makes avatars move realistically from sound.
Generating full-body human gestures encompassing face, body, hands, and global movements from audio is a valuable yet challenging task in virtual avatar creation. Previous systems focused on tokenizing the human gestures framewisely and predicting the tokens of each frame from the input audio. However, one observation is that the number of frames required for a complete expressive human gesture, defined as granularity, varies among different human gesture patterns. Existing systems fail to model these gesture patterns due to the fixed granularity of their gesture tokens. To solve this problem, we propose a novel framework named Multi-Granular Gesture Generator (M3G) for audio-driven holistic gesture generation. In M3G, we propose a novel Multi-Granular VQ-VAE (MGVQ-VAE) to tokenize motion patterns and reconstruct motion sequences from different temporal granularities. Subsequently, we proposed a multi-granular token predictor that extracts multi-granular information from audio and predicts the corresponding motion tokens. Then M3G reconstructs the human gestures from the predicted tokens using the MGVQ-VAE. Both objective and subjective experiments demonstrate that our proposed M3G framework outperforms the state-of-the-art methods in terms of generating natural and expressive full-body human gestures.
Similar Papers
Ges-QA: A Multidimensional Quality Assessment Dataset for Audio-to-3D Gesture Generation
Multimedia
Makes computer gestures match sounds better.
GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation
CV and Pattern Recognition
Makes computer animations move more realistically.
MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation without Vector Quantization
Graphics
Makes computer characters move hands naturally while talking.