Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech
By: Xinyu Liang , Fredrik Cumlin , Victor Ungureanu and more
Potential Business Impact:
Makes computers judge sound quality better.
Self-supervised learning (SSL) models like Wav2Vec2, HuBERT, and WavLM have been widely used in speech processing. These transformer-based models consist of multiple layers, each capturing different levels of representation. While prior studies explored their layer-wise representations for efficiency and performance, speech quality assessment (SQA) models predominantly rely on last-layer features, leaving intermediate layers underexamined. In this work, we systematically evaluate different layers of multiple SSL models for predicting mean-opinion-score (MOS). Features from each layer are fed into a lightweight regression network to assess effectiveness. Our experiments consistently show early-layers features outperform or match those from the last layer, leading to significant improvements over conventional approaches and state-of-the-art MOS prediction models. These findings highlight the advantages of early-layer selection, offering enhanced performance and reduced system complexity.
Similar Papers
Layer-wise Analysis for Quality of Multilingual Synthesized Speech
Audio and Speech Processing
Makes computer voices sound more human-like.
Layer-Wise Analysis of Self-Supervised Representations for Age and Gender Classification in Children's Speech
Audio and Speech Processing
Helps computers tell kids' ages and genders.
Can Layer-wise SSL Features Improve Zero-Shot ASR Performance for Children's Speech?
Audio and Speech Processing
Makes computers understand kids' talking better.