DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing automatic evaluation methods for text-to-music generation systems struggle to optimize ranking metrics and exhibit weak cross-modal consistency. This work proposes DeRA-MOS, a novel framework that decouples listwise ranking from modality alignment objectives for the first time. It employs a batch-aware listwise ranking loss to optimize the ranking performance of musical impressions and integrates a score-anchored modality alignment loss to enhance semantic consistency between text and music. By explicitly addressing pointwise training bias and modality drift, the proposed approach significantly improves Spearman rank correlation on the MusicEval benchmark, establishing a new paradigm for large-scale evaluation of text-to-music generation systems.
📝 Abstract
Evaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean opinion scores (MOS). Most automatic MOS estimators are trained with point-wise regression or distributional classification. These objectives do not directly optimize rank-based metrics and provide weak geometric constraints for cross-modal coherence. To address these gaps, we propose DeRA-MOS, a decoupled optimization framework for TTM evaluation. For MI, we introduce a batch-aware listwise ranking loss that models relative order within each mini-batch and better aligns with evaluation based on Spearman's rank correlation coefficient (SRCC). For TA, we introduce a score-anchored modality alignment loss that maps human scores to target audio-text similarity and regularizes the latent space before fusion. By effectively mitigating the point-wise training mismatch and modality drift, experiments on MusicEval demonstrate that our decoupled framework yields substantial improvements in both MI and TA ranking metrics, establishing a robust paradigm for large-scale TTM evaluation.
Problem

Research questions and friction points this paper is trying to address.

text-to-music
evaluation
mean opinion score
modality alignment
ranking
Innovation

Methods, ideas, or system contributions that make the work stand out.

listwise ranking
modality alignment
text-to-music evaluation
decoupled optimization
Spearman rank correlation