Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization
By: Chaoqun Cui , Liangbin Huang , Shijing Wang and more
Potential Business Impact:
Makes dubbed videos match the original speaking time.
Video dubbing aims to translate original speech in visual media programs from the source language to the target language, relying on neural machine translation and text-to-speech technologies. Due to varying information densities across languages, target speech often mismatches the source speech duration, causing audio-video synchronization issues that significantly impact viewer experience. In this study, we approach duration alignment in LLM-based video dubbing machine translation as a preference optimization problem. We propose the Segment Supervised Preference Optimization (SSPO) method, which employs a segment-wise sampling strategy and fine-grained loss to mitigate duration mismatches between source and target lines. Experimental results demonstrate that SSPO achieves superior performance in duration alignment tasks.
Similar Papers
SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model
Audio and Speech Processing
Makes videos speak in any language, perfectly synced.
Length Aware Speech Translation for Video Dubbing
Computation and Language
Makes dubbed movie voices match the talking.
VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task
Multimedia
Makes videos talk with matching lip movements.