Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation
By: Guangtao Lyu , Chenghao Xu , Qi Liu and more
Music to 3D dance generation aims to synthesize realistic and rhythmically synchronized human dance from music. While existing methods often rely on additional genre labels to further improve dance generation, such labels are typically noisy, coarse, unavailable, or insufficient to capture the diversity of real-world music, which can result in rhythm misalignment or stylistic drift. In contrast, we observe that tempo, a core property reflecting musical rhythm and pace, remains relatively consistent across datasets and genres, typically ranging from 60 to 200 BPM. Based on this finding, we propose TempoMoE, a hierarchical tempo-aware Mixture-of-Experts module that enhances the diffusion model and its rhythm perception. TempoMoE organizes motion experts into tempo-structured groups for different tempo ranges, with multi-scale beat experts capturing fine- and long-range rhythmic dynamics. A Hierarchical Rhythm-Adaptive Routing dynamically selects and fuses experts from music features, enabling flexible, rhythm-aligned generation without manual genre labels. Extensive experiments demonstrate that TempoMoE achieves state-of-the-art results in dance quality and rhythm alignment.
Similar Papers
Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation
Sound
Makes videos play music that matches the action.
Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation
Sound
Makes videos play music that perfectly matches the action.
Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation
Sound
Makes videos play music that matches the action.