Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
By: Siyan Chen , Yanfei Chen , Ying Chen and more
Potential Business Impact:
Makes videos and sounds match perfectly.
Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.
Similar Papers
Seedance 1.0: Exploring the Boundaries of Video Generation Models
CV and Pattern Recognition
Makes videos from words, faster and better.
SALSA-V: Shortcut-Augmented Long-form Synchronized Audio from Videos
Sound
Makes silent videos talk with realistic sound.
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model
CV and Pattern Recognition
Creates better pictures from Chinese and English words.