MusicEval: A Generative Music Corpus with Expert Ratings for Automatic Text-to-Music Evaluation

📅 2025-01-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-music (TTM) generation lacks automated evaluation methods that jointly ensure accuracy and efficiency—primarily due to the high subjectivity of musical quality assessment, challenges in cross-modal alignment between text and audio, and scarcity of high-quality human-annotated data. To address this, we introduce the first expert-annotated TTM evaluation dataset, comprising 2,748 music clips generated by 31 models and 13,740 fine-grained professional critiques. We further establish the first dedicated TTM benchmark and propose a learnable evaluation paradigm grounded in CLAP (Contrastive Language–Audio Pretraining), integrating music semantic annotation, cross-modal alignment, and regression-based scoring. Our learned evaluator achieves strong agreement with human judgments (Spearman ρ > 0.82), substantially outperforming conventional objective metrics (e.g., FAD, KL divergence). The framework provides a reliable, low-cost, and fully reproducible automated evaluation standard for TTM research and development.

Technology Category

Application Category

📝 Abstract
The technology for generating music from textual descriptions has seen rapid advancements. However, evaluating text-to-music (TTM) systems remains a significant challenge, primarily due to the difficulty of balancing performance and cost with existing objective and subjective evaluation methods. In this paper, we propose an automatic assessment task for TTM models to align with human perception. To address the TTM evaluation challenges posed by the professional requirements of music evaluation and the complexity of the relationship between text and music, we collect MusicEval, the first generative music assessment dataset. This dataset contains 2,748 music clips generated by 31 advanced and widely used models in response to 384 text prompts, along with 13,740 ratings from 14 music experts. Furthermore, we design a CLAP-based assessment model built on this dataset, and our experimental results validate the feasibility of the proposed task, providing a valuable reference for future development in TTM evaluation. The dataset is available at https://www.aishelltech.com/AISHELL_7A.
Problem

Research questions and friction points this paper is trying to address.

Automatic Evaluation
Text-to-Music
Cost-Effectiveness
Innovation

Methods, ideas, or system contributions that make the work stand out.

MusicEval Dataset
Automatic Evaluation Method
Text-to-Music Generation
🔎 Similar Papers
2024-09-30Citations: 2
C
Cheng Liu
TMCC, College of Computer Science, Nankai University, Tianjin, China
H
Hui Wang
TMCC, College of Computer Science, Nankai University, Tianjin, China
Jinghua Zhao
Jinghua Zhao
Nankai University
Shiwan Zhao
Shiwan Zhao
Independent Researcher, Research Scientist of IBM Research - China (2000-2020)
AGILarge Language ModelNLPSpeechRecommeder System
Hui Bu
Hui Bu
aishell
Speech Recognition、Speech databases and text corpora、Special topics on speech databases and
X
Xin Xu
Beijing AISHELL Technology Co., Ltd.
J
Jiaming Zhou
TMCC, College of Computer Science, Nankai University, Tianjin, China
Haoqin Sun
Haoqin Sun
Nankai University
Affective computingSpeech signal processingAudio understanding
Y
Yong Qin
TMCC, College of Computer Science, Nankai University, Tianjin, China