Score: 1

Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation

Published: November 12, 2025 | arXiv ID: 2511.09090v1

By: Shulei Ji , Zihao Wang , Jiaxing Yu and more

Potential Business Impact:

Makes videos play music that matches the action.

Business Areas:

Motion Capture Media and Entertainment, Video

Video-to-music (V2M) generation aims to create music that aligns with visual content. However, two main challenges persist in existing methods: (1) the lack of explicit rhythm modeling hinders audiovisual temporal alignments; (2) effectively integrating various visual features to condition music generation remains non-trivial. To address these issues, we propose Diff-V2M, a general V2M framework based on a hierarchical conditional diffusion model, comprising two core components: visual feature extraction and conditional music generation. For rhythm modeling, we begin by evaluating several rhythmic representations, including low-resolution mel-spectrograms, tempograms, and onset detection functions (ODF), and devise a rhythmic predictor to infer them directly from videos. To ensure contextual and affective coherence, we also extract semantic and emotional features. All features are incorporated into the generator via a hierarchical cross-attention mechanism, where emotional features shape the affective tone via the first layer, while semantic and rhythmic features are fused in the second cross-attention layer. To enhance feature integration, we introduce timestep-aware fusion strategies, including feature-wise linear modulation (FiLM) and weighted fusion, allowing the model to adaptively balance semantic and rhythmic cues throughout the diffusion process. Extensive experiments identify low-resolution ODF as a more effective signal for modeling musical rhythm and demonstrate that Diff-V2M outperforms existing models on both in-domain and out-of-domain datasets, achieving state-of-the-art performance in terms of objective metrics and subjective comparisons. Demo and code are available at https://Tayjsl97.github.io/Diff-V2M-Demo/.

Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation

Sound

Makes videos play music that matches the action.

12 Nov 2025 0

91%

Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation

Sound

Makes videos play music that perfectly matches the action.

12 Nov 2025 0

91%

Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation

Sound

Makes videos play music that matches the action.

12 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

11 pages

Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation

Makes videos play music that matches the action.

Technical Abstract

Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation

Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation

Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation