SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction
By: Saurabh Agrawal , Raj Gohil , Gopal Kumar Agrawal and more
Potential Business Impact:
Helps pick the best computer-generated voices.
Speech quality assessment is a critical process in selecting text-to-speech synthesis (TTS) or voice conversion models. Evaluation of voice synthesis can be done using objective metrics or subjective metrics. Although there are many objective metrics like the Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Assessment (POLQA) or Short-Time Objective Intelligibility (STOI) but none of them is feasible in selecting the best model. On the other hand subjective metric like Mean Opinion Score is highly reliable but it requires a lot of manual efforts and are time-consuming. To counter the issues in MOS Evaluation, we have developed a novel model, Speaker Agnostic Latent Features (SALF)-Mean Opinion Score (MOS) which is a small-sized, end-to-end, highly generalized and scalable model for predicting MOS score on a scale of 5. We use the sequences of convolutions and stack them to get the latent features of the audio samples to get the best state-of-the-art results based on mean squared error (MSE), Linear Concordance Correlation coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall Rank Correlation Coefficient (KTAU).
Similar Papers
APG-MOS: Auditory Perception Guided-MOS Predictor for Synthetic Speech
Sound
Makes computers judge voice quality like people.
Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech
Audio and Speech Processing
Makes computers judge sound quality better.
Self-Supervised Speech Quality Assessment (S3QA): Leveraging Speech Foundation Models for a Scalable Speech Quality Metric
Audio and Speech Processing
Makes computers judge talking quality like humans.