SincQDR-VAD: A Noise-Robust Voice Activity Detection Framework Leveraging Learnable Filters and Ranking-Aware Optimization
By: Chien-Chun Wang , En-Lun Yu , Jeih-Weih Hung and more
Potential Business Impact:
Helps voice assistants hear you better in noise.
Voice activity detection (VAD) is essential for speech-driven applications, but remains far from perfect in noisy and resource-limited environments. Existing methods often lack robustness to noise, and their frame-wise classification losses are only loosely coupled with the evaluation metric of VAD. To address these challenges, we propose SincQDR-VAD, a compact and robust framework that combines a Sinc-extractor front-end with a novel quadratic disparity ranking loss. The Sinc-extractor uses learnable bandpass filters to capture noise-resistant spectral features, while the ranking loss optimizes the pairwise score order between speech and non-speech frames to improve the area under the receiver operating characteristic curve (AUROC). A series of experiments conducted on representative benchmark datasets show that our framework considerably improves both AUROC and F2-Score, while using only 69% of the parameters compared to prior arts, confirming its efficiency and practical viability.
Similar Papers
QvTAD: Differential Relative Attribute Learning for Voice Timbre Attribute Detection
Sound
Makes computer voices sound more like real people.
Noise-Robust Target-Speaker Voice Activity Detection Through Self-Supervised Pretraining
Audio and Speech Processing
Helps computers hear specific voices in noise.
Enhancing Automatic Speech Recognition Through Integrated Noise Detection Architecture
Sound
Helps computers hear words better in noisy places.