High-Fidelity Speech Enhancement via Discrete Audio Tokens
By: Luca A. Lanzendörfer , Frédéric Berdoz , Antonis Asonitis and more
Potential Business Impact:
Cleans up noisy speech for better hearing.
Recent autoregressive transformer-based speech enhancement (SE) methods have shown promising results by leveraging advanced semantic understanding and contextual modeling of speech. However, these approaches often rely on complex multi-stage pipelines and low sampling rate codecs, limiting them to narrow and task-specific speech enhancement. In this work, we introduce DAC-SE1, a simplified language model-based SE framework leveraging discrete high-resolution audio representations; DAC-SE1 preserves fine-grained acoustic details while maintaining semantic coherence. Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation. We release our codebase and model checkpoints to support further research in scalable, unified, and high-quality speech enhancement.
Similar Papers
UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement
Sound
Cleans up noisy audio for many tasks.
Improving Resource-Efficient Speech Enhancement via Neural Differentiable DSP Vocoder Refinement
Audio and Speech Processing
Cleans up noisy sounds for small gadgets.
LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement
Audio and Speech Processing
Makes voices clearer while keeping their original sound.