Score: 1

MoEScore: Mixture-of-Experts-Based Text-Audio Relevance Score Prediction for Text-to-Audio System Evaluation

Published: January 11, 2026 | arXiv ID: 2601.06829v1

By: Bochao Sun, Yang Xiao, Han Yin

Potential Business Impact:

Makes computer-made sounds match words better.

Business Areas:
Semantic Search Internet Services

Recent advances in generative models have enabled modern Text-to-Audio (TTA) systems to synthesize audio with high perceptual quality. However, TTA systems often struggle to maintain semantic consistency with the input text, leading to mismatches in sound events, temporal tructures, or contextual relationships. Evaluating semantic fidelity in TTA remains a significant challenge. Traditional methods primarily rely on subjective human listening tests, which is time-consuming. To solve this, we propose an objective evaluator based on a Mixture of Experts (MoE) architecture with Sequential Cross-Attention (SeqCoAttn). Our model achieves the first rank in the XACLE Challenge, with an SRCC of 0.6402 (an improvement of 30.6% over the challenge baseline) on the test dataset. Code is available at: https://github.com/S-Orion/MOESCORE.

Country of Origin
🇰🇷 Korea, Republic of

Repos / Data Links

Page Count
3 pages

Category
Computer Science:
Sound