Score: 1

Towards Unified Co-Speech Gesture Generation via Hierarchical Implicit Periodicity Learning

Published: December 15, 2025 | arXiv ID: 2512.13131v1

By: Xin Guo, Yifan Zhao, Jia Li

Potential Business Impact:

Makes talking characters move naturally.

Business Areas:

Motion Capture Media and Entertainment, Video

Generating 3D-based body movements from speech shows great potential in extensive downstream applications, while it still suffers challenges in imitating realistic human movements. Predominant research efforts focus on end-to-end generation schemes to generate co-speech gestures, spanning GANs, VQ-VAE, and recent diffusion models. As an ill-posed problem, in this paper, we argue that these prevailing learning schemes fail to model crucial inter- and intra-correlations across different motion units, i.e. head, body, and hands, thus leading to unnatural movements and poor coordination. To delve into these intrinsic correlations, we propose a unified Hierarchical Implicit Periodicity (HIP) learning approach for audio-inspired 3D gesture generation. Different from predominant research, our approach models this multi-modal implicit relationship by two explicit technique insights: i) To disentangle the complicated gesture movements, we first explore the gesture motion phase manifolds with periodic autoencoders to imitate human natures from realistic distributions while incorporating non-period ones from current latent states for instance-level diversities. ii) To model the hierarchical relationship of face motions, body gestures, and hand movements, driving the animation with cascaded guidance during learning. We exhibit our proposed approach on 3D avatars and extensive experiments show our method outperforms the state-of-the-art co-speech gesture generation methods by both quantitative and qualitative evaluations. Code and models will be publicly available.

Co-speech Gesture Video Generation via Motion-Based Graph Retrieval

CV and Pattern Recognition

Makes talking videos show matching hand movements.

2 Dec 2025 1

90%

Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion

CV and Pattern Recognition

Makes talking characters move hands together.

3 May 2025 1

90%

CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation

CV and Pattern Recognition

Makes talking robots move hands with words.

28 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

13 pages

Towards Unified Co-Speech Gesture Generation via Hierarchical Implicit Periodicity Learning

Makes talking characters move naturally.

Technical Abstract

Co-speech Gesture Video Generation via Motion-Based Graph Retrieval

Co$^{3}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion

CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation