🤖 AI Summary
Current text-to-music (TTM) generation lacks automated evaluation methods that jointly ensure accuracy and efficiency—primarily due to the high subjectivity of musical quality assessment, challenges in cross-modal alignment between text and audio, and scarcity of high-quality human-annotated data. To address this, we introduce the first expert-annotated TTM evaluation dataset, comprising 2,748 music clips generated by 31 models and 13,740 fine-grained professional critiques. We further establish the first dedicated TTM benchmark and propose a learnable evaluation paradigm grounded in CLAP (Contrastive Language–Audio Pretraining), integrating music semantic annotation, cross-modal alignment, and regression-based scoring. Our learned evaluator achieves strong agreement with human judgments (Spearman ρ > 0.82), substantially outperforming conventional objective metrics (e.g., FAD, KL divergence). The framework provides a reliable, low-cost, and fully reproducible automated evaluation standard for TTM research and development.
📝 Abstract
The technology for generating music from textual descriptions has seen rapid advancements. However, evaluating text-to-music (TTM) systems remains a significant challenge, primarily due to the difficulty of balancing performance and cost with existing objective and subjective evaluation methods. In this paper, we propose an automatic assessment task for TTM models to align with human perception. To address the TTM evaluation challenges posed by the professional requirements of music evaluation and the complexity of the relationship between text and music, we collect MusicEval, the first generative music assessment dataset. This dataset contains 2,748 music clips generated by 31 advanced and widely used models in response to 384 text prompts, along with 13,740 ratings from 14 music experts. Furthermore, we design a CLAP-based assessment model built on this dataset, and our experimental results validate the feasibility of the proposed task, providing a valuable reference for future development in TTM evaluation. The dataset is available at https://www.aishelltech.com/AISHELL_7A.