VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency
By: Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze
Potential Business Impact:
Speaks words instantly as you type them.
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on GPU. Despite being trained on a mid-scale 9k-hour corpus, it matches or surpasses larger baselines on several metrics, while delivering competitive quality in both output- and full-streaming settings. Demo and code are available at https://herimor.github.io/voxtream.
Similar Papers
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
Computation and Language
Lets computers talk and understand like humans.
StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling
Sound
Makes computers talk like real people instantly.
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Sound
Makes computers talk with any voice, any style.