Score: 1

Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

Published: November 26, 2025 | arXiv ID: 2511.21579v1

By: Teng Hu , Zhentao Yu , Guozhen Zhang and more

Potential Business Impact:

Makes videos match sounds perfectly.

Business Areas:

Augmented Reality Hardware, Software

The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.

UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

CV and Pattern Recognition

Makes videos match sounds perfectly.

5 Nov 2025 1

89%

Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm

CV and Pattern Recognition

Makes videos match sounds automatically and easily.

5 Aug 2025 0

88%

Harmony-Aware Music-driven Motion Synthesis with Perceptual Constraint on UGC Datasets

Multimedia

Makes dance videos match music perfectly.

8 Jun 2025 0

View PDF Login to Bookmark

Page Count

21 pages

Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

Makes videos match sounds perfectly.

Technical Abstract

UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm

Harmony-Aware Music-driven Motion Synthesis with Perceptual Constraint on UGC Datasets