ASTAR-NTU solution to AudioMOS Challenge 2025 Track1
By: Fabian Ritter-Gutierrez , Yi-Cheng Lin , Jui-Chiang Wei and more
Potential Business Impact:
Makes computers judge music quality automatically.
Evaluation of text-to-music systems is constrained by the cost and availability of collecting experts for assessment. AudioMOS 2025 Challenge track 1 is created to automatically predict music impression (MI) as well as text alignment (TA) between the prompt and the generated musical piece. This paper reports our winning system, which uses a dual-branch architecture with pre-trained MuQ and RoBERTa models as audio and text encoders. A cross-attention mechanism fuses the audio and text representations. For training, we reframe the MI and TA prediction as a classification task. To incorporate the ordinal nature of MOS scores, one-hot labels are converted to a soft distribution using a Gaussian kernel. On the official test set, a single model trained with this method achieves a system-level Spearman's Rank Correlation Coefficient (SRCC) of 0.991 for MI and 0.952 for TA, corresponding to a relative improvement of 21.21\% in MI SRCC and 31.47\% in TA SRCC over the challenge baseline.
Similar Papers
MoEScore: Mixture-of-Experts-Based Text-Audio Relevance Score Prediction for Text-to-Audio System Evaluation
Sound
Makes computer-made sounds match words better.
The AudioMOS Challenge 2025
Sound
Makes computers judge fake sounds as good or bad.
Breaking the Barriers of Text-Hungry and Audio-Deficient AI
Sound
Lets computers understand and speak any language.