Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation
By: Yongqi Wang , Chunlei Zhang , Hangting Chen and more
Potential Business Impact:
Makes voices sound like anyone, with any emotion.
Controllable TTS models with natural language prompts often lack the ability for fine-grained control and face a scarcity of high-quality data. We propose a two-stage style-controllable TTS system with language models, utilizing a quantized masked-autoencoded style-rich representation as an intermediary. In the first stage, an autoregressive transformer is used for the conditional generation of these style-rich tokens from text and control signals. The second stage generates codec tokens from both text and sampled style-rich tokens. Experiments show that training the first-stage model on extensive datasets enhances the content robustness of the two-stage model as well as control capabilities over multiple attributes. By selectively combining discrete labels and speaker embeddings, we explore fully controlling the speaker's timbre and other stylistic information, and adjusting attributes like emotion for a specified speaker. Audio samples are available at https://style-ar-tts.github.io.
Similar Papers
Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation
Sound
Makes one voice talk in many languages.
AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis
Sound
Makes computer voices sound more human and lively.
Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling
Sound
Makes computer voices sound more natural and human.