ReFESS-QI: Reference-Free Evaluation For Speech Separation With Joint Quality And Intelligibility Scoring
By: Ari Frummer , Helin Wang , Tianyu Cao and more
Potential Business Impact:
Cleans up noisy sounds without needing original audio.
Source separation is a crucial pre-processing step for various speech processing tasks, such as automatic speech recognition (ASR). Traditionally, the evaluation metrics for speech separation rely on the matched reference audios and corresponding transcriptions to assess audio quality and intelligibility. However, they cannot be used to evaluate real-world mixtures for which no reference exists. This paper introduces a text-free reference-free evaluation framework based on self-supervised learning (SSL) representations. The proposed framework utilize the mixture and separated tracks to predict jointly audio quality, through the Scale Invariant Signal to Noise Ratio (SI-SNR) metric, and speech intelligibility through the Word Error Rate (WER) metric. We conducted experiments on the WHAMR! dataset, which shows a WER estimation with a mean absolute error (MAE) of 17% and a Pearson correlation coefficient (PCC) of 0.77; and SI-SNR estimation with an MAE of 1.38 and PCC of 0.95. We further demonstrate the robustness of our estimator by using various SSL representations.
Similar Papers
A Study of the Scale Invariant Signal to Distortion Ratio in Speech Separation with Noisy References
Audio and Speech Processing
Cleans up noisy speech for clearer listening.
Layer-wise Analysis for Quality of Multilingual Synthesized Speech
Audio and Speech Processing
Makes computer voices sound more human-like.
MAPSS: Manifold-based Assessment of Perceptual Source Separation
Audio and Speech Processing
Makes music and voice separation sound better.