Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings
By: Dyah A. M. G. Wisnu , Ryandhimas E. Zezario , Stefano Rini and more
Potential Business Impact:
Rates how good computer-made sounds are.
We present a system for automatic multi-axis perceptual quality prediction of generative audio, developed for Track 2 of the AudioMOS Challenge 2025. The task is to predict four Audio Aesthetic Scores--Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness--for audio generated by text-to-speech (TTS), text-to-audio (TTA), and text-to-music (TTM) systems. A main challenge is the domain shift between natural training data and synthetic evaluation data. To address this, we combine BEATs, a pretrained transformer-based audio representation model, with a multi-branch long short-term memory (LSTM) predictor and use a triplet loss with buffer-based sampling to structure the embedding space by perceptual similarity. Our results show that this improves embedding discriminability and generalization, enabling domain-robust audio quality assessment without synthetic training data.
Similar Papers
The AudioMOS Challenge 2025
Sound
Makes computers judge fake sounds as good or bad.
AudioEval: Automatic Dual-Perspective and Multi-Dimensional Evaluation of Text-to-Audio-Generation
Sound
Helps computers judge how good spoken words sound.
Multidimensional Music Aesthetic Evaluation via Semantically Consistent C-Mixup Augmentation
Sound
Makes music sound better by learning what people like.