LauraTSE: Target Speaker Extraction using Auto-Regressive Decoder-Only Language Models
By: Beilong Tang, Bang Zeng, Ming Li
Potential Business Impact:
Separates voices from noisy recordings.
We propose LauraTSE, an Auto-Regressive Decoder-Only Language Model for Target Speaker Extraction built upon the LauraGPT backbone. LauraTSE employs a small-scale auto-regressive decoder-only language model that generates the initial layers of the target speech's discrete codec representations from the continuous embeddings of both the mixture and reference speech. These outputs serve as coarse-grained predictions. To refine them, a one-step encoder-only language model reconstructs the full codec representation by integrating information from both the mixture and the reference speech, adding fine-grained details. Experimental results show that our approach can achieve promising performance. Additionally, we conduct ablation studies to investigate the data scalability and the contribution of the encoder-only model.
Similar Papers
Online Audio-Visual Autoregressive Speaker Extraction
Audio and Speech Processing
Helps computers hear one voice in noisy rooms.
UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement
Sound
Cleans up noisy audio for many tasks.
Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling
Sound
Makes computer voices sound more natural and human.