Score: 1

MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

Published: December 26, 2025 | arXiv ID: 2512.22310v1

By: Run Ling , Ke Cao , Jian Lu and more

BigTech Affiliations: JD.com

Potential Business Impact:

Makes videos match pictures and text perfectly.

Business Areas:

Motion Capture Media and Entertainment, Video

Multi-subject video generation aims to synthesize videos from textual prompts and multiple reference images, ensuring that each subject preserves natural scale and visual fidelity. However, current methods face two challenges: scale inconsistency, where variations in subject size lead to unnatural generation, and permutation sensitivity, where the order of reference inputs causes subject distortion. In this paper, we propose MoFu, a unified framework that tackles both challenges. For scale inconsistency, we introduce Scale-Aware Modulation (SMO), an LLM-guided module that extracts implicit scale cues from the prompt and modulates features to ensure consistent subject sizes. To address permutation sensitivity, we present a simple yet effective Fourier Fusion strategy that processes the frequency information of reference features via the Fast Fourier Transform to produce a unified representation. Besides, we design a Scale-Permutation Stability Loss to jointly encourage scale-consistent and permutation-invariant generation. To further evaluate these challenges, we establish a dedicated benchmark with controlled variations in subject scale and reference permutation. Extensive experiments demonstrate that MoFu significantly outperforms existing methods in preserving natural scale, subject fidelity, and overall visual quality.

CoMo: Compositional Motion Customization for Text-to-Video Generation

CV and Pattern Recognition

Makes videos show many actions at once.

27 Oct 2025 1

87%

MoCo: Motion-Consistent Human Video Generation via Structure-Appearance Decoupling

CV and Pattern Recognition

Makes videos of people move realistically from words.

24 Aug 2025 0

87%

OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation

CV and Pattern Recognition

Makes AI draw people better, even in crowds.

9 Dec 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

12 pages

MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation

Makes videos match pictures and text perfectly.

Technical Abstract

CoMo: Compositional Motion Customization for Text-to-Video Generation

MoCo: Motion-Consistent Human Video Generation via Structure-Appearance Decoupling

OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation