🤖 AI Summary
To address the high cost and scarcity of expert resources in subjective evaluation of text-to-music generation systems, this work proposes a dual-branch cross-attention model that jointly models audio and textual semantics to automatically predict Music Impressiveness (MI) and Text Alignment (TA) quality. We innovatively formulate ordinal Mean Opinion Score (MOS) ratings as soft label distributions via Gaussian kernel smoothing and employ pre-trained MuQ (audio) and RoBERTa (text) encoders, coupled with cross-modal cross-attention for fine-grained alignment. On the AudioMOS 2025 Challenge Track 1 benchmark, our single model achieves MI-SRCC = 0.991 and TA-SRCC = 0.952—outperforming the baseline by 21.2% and 31.5%, respectively. This significantly advances the accuracy and practicality of automated quality assessment for music generation.
📝 Abstract
Evaluation of text-to-music systems is constrained by the cost and availability of collecting experts for assessment. AudioMOS 2025 Challenge track 1 is created to automatically predict music impression (MI) as well as text alignment (TA) between the prompt and the generated musical piece. This paper reports our winning system, which uses a dual-branch architecture with pre-trained MuQ and RoBERTa models as audio and text encoders. A cross-attention mechanism fuses the audio and text representations. For training, we reframe the MI and TA prediction as a classification task. To incorporate the ordinal nature of MOS scores, one-hot labels are converted to a soft distribution using a Gaussian kernel. On the official test set, a single model trained with this method achieves a system-level Spearman's Rank Correlation Coefficient (SRCC) of 0.991 for MI and 0.952 for TA, corresponding to a relative improvement of 21.21% in MI SRCC and 31.47% in TA SRCC over the challenge baseline.