Real-Time Streaming Mel Vocoding with Generative Flow Matching
By: Simon Welker, Tal Peer, Timo Gerkmann
Potential Business Impact:
Makes computer voices sound more real, faster.
The task of Mel vocoding, i.e., the inversion of a Mel magnitude spectrogram to an audio waveform, is still a key component in many text-to-speech (TTS) systems today. Based on generative flow matching, our prior work on generative STFT phase retrieval (DiffPhase), and the pseudoinverse operator of the Mel filterbank, we develop MelFlow, a streaming-capable generative Mel vocoder for speech sampled at 16 kHz with an algorithmic latency of only 32 ms and a total latency of 48 ms. We show real-time streaming capability at this latency not only in theory, but in practice on a consumer laptop GPU. Furthermore, we show that our model achieves substantially better PESQ and SI-SDR values compared to well-established not streaming-capable baselines for Mel vocoding including HiFi-GAN.
Similar Papers
UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching
Audio and Speech Processing
Makes quiet sounds loud and clear.
StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling
Sound
Makes computers talk like real people instantly.
WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching
Sound
Makes computer voices sound more real and faster.