Score: 1

A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis

Published: September 16, 2025 | arXiv ID: 2509.12831v1

By: Javeria Amir , Farwa Attaria , Mah Jabeen and more

Potential Business Impact:

Makes talking robots sound real, even with noise.

Business Areas:
Speech Recognition Data and Analytics, Software

Recent developments in voice cloning and talking head generation demonstrate impressive capabilities in synthesizing natural speech and realistic lip synchronization. Current methods typically require and are trained on large scale datasets and computationally intensive processes using clean studio recorded inputs that is infeasible in noisy or low resource environments. In this paper, we introduce a new modular pipeline comprising Tortoise text to speech. It is a transformer based latent diffusion model that can perform high fidelity zero shot voice cloning given only a few training samples. We use a lightweight generative adversarial network architecture for robust real time lip synchronization. The solution will contribute to many essential tasks concerning less reliance on massive pre training generation of emotionally expressive speech and lip synchronization in noisy and unconstrained scenarios. The modular structure of the pipeline allows an easy extension for future multi modal and text guided voice modulation and it could be used in real world systems.

Country of Origin
🇰🇷 Korea, Republic of

Page Count
22 pages

Category
Computer Science:
Sound