Latent Flow Matching for Expressive Singing Voice Synthesis
By: Minhyeok Yun, Yong-Hoon Choi
Potential Business Impact:
Makes computer singing sound more human-like.
Conditional variational autoencoder (cVAE)-based singing voice synthesis provides efficient inference and strong audio quality by learning a score-conditioned prior and a recording-conditioned posterior latent space. However, because synthesis relies on prior samples while training uses posterior latents inferred from real recordings, imperfect distribution matching can cause a prior-posterior mismatch that degrades fine-grained expressiveness such as vibrato and micro-prosody. We propose FM-Singer, which introduces conditional flow matching (CFM) in latent space to learn a continuous vector field transporting prior latents toward posterior latents along an optimal-transport-inspired path. At inference time, the learned latent flow refines a prior sample by solving an ordinary differential equation (ODE) before waveform generation, improving expressiveness while preserving the efficiency of parallel decoding. Experiments on Korean and Chinese singing datasets demonstrate consistent improvements over strong baselines, including lower mel-cepstral distortion and fundamental-frequency error and higher perceptual scores on the Korean dataset. Code, pretrained checkpoints, and audio demos are available at https://github.com/alsgur9368/FM-Singer
Similar Papers
LatentFM: A Latent Flow Matching Approach for Generative Medical Image Segmentation
CV and Pattern Recognition
Creates better medical scans with built-in confidence.
CTFlow: Video-Inspired Latent Flow Matching for 3D CT Synthesis
CV and Pattern Recognition
Creates fake CT scans from doctor's notes.
DAFMSVC: One-Shot Singing Voice Conversion with Dual Attention Mechanism and Flow Matching
Sound
Changes one person's singing voice to another's.