ASTAR-NTU solution to AudioMOS Challenge 2025 Track1

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

To address the high cost and scarcity of expert resources in subjective evaluation of text-to-music generation systems, this work proposes a dual-branch cross-attention model that jointly models audio and textual semantics to automatically predict Music Impressiveness (MI) and Text Alignment (TA) quality. We innovatively formulate ordinal Mean Opinion Score (MOS) ratings as soft label distributions via Gaussian kernel smoothing and employ pre-trained MuQ (audio) and RoBERTa (text) encoders, coupled with cross-modal cross-attention for fine-grained alignment. On the AudioMOS 2025 Challenge Track 1 benchmark, our single model achieves MI-SRCC = 0.991 and TA-SRCC = 0.952—outperforming the baseline by 21.2% and 31.5%, respectively. This significantly advances the accuracy and practicality of automated quality assessment for music generation.

Technology Category

Application Category

📝 Abstract

Evaluation of text-to-music systems is constrained by the cost and availability of collecting experts for assessment. AudioMOS 2025 Challenge track 1 is created to automatically predict music impression (MI) as well as text alignment (TA) between the prompt and the generated musical piece. This paper reports our winning system, which uses a dual-branch architecture with pre-trained MuQ and RoBERTa models as audio and text encoders. A cross-attention mechanism fuses the audio and text representations. For training, we reframe the MI and TA prediction as a classification task. To incorporate the ordinal nature of MOS scores, one-hot labels are converted to a soft distribution using a Gaussian kernel. On the official test set, a single model trained with this method achieves a system-level Spearman's Rank Correlation Coefficient (SRCC) of 0.991 for MI and 0.952 for TA, corresponding to a relative improvement of 21.21% in MI SRCC and 31.47% in TA SRCC over the challenge baseline.

Problem

Research questions and friction points this paper is trying to address.

Automate music impression prediction for text-to-music systems

Improve text alignment evaluation between prompts and music

Reduce reliance on costly expert assessments for evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-branch architecture with MuQ and RoBERTa

Cross-attention mechanism for audio-text fusion

Gaussian kernel converts labels to soft distribution

🔎 Similar Papers

Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs