AImoclips: A Benchmark for Evaluating Emotion Conveyance in Text-to-Music Generation

📅 2025-08-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-music (TTM) systems lack systematic, human-perception–based evaluation of emotional expressiveness. Method: We introduce AImoclips, the first benchmark dedicated to emotional fidelity in TTM, covering 12 emotion categories spanning the valence–arousal quadrants. It comprises over 1,000 audio samples generated by six state-of-the-art TTM models and evaluated via large-scale subjective testing with 111 participants using a 9-point Likert scale. Contribution/Results: Our study reveals a pervasive “emotional neutralization” bias across TTM models—i.e., a tendency to generate emotionally muted outputs regardless of input intent. We further find that high-arousal emotions are more reliably recognized than low-arousal ones; commercial models systematically favor high-valence (pleasurable) outputs, whereas open-source models exhibit the opposite tendency. This work establishes a reproducible, perception-grounded evaluation framework and provides the first empirical evidence characterizing emotional controllability in TTM generation.

Technology Category

Application Category

📝 Abstract
Recent advances in text-to-music (TTM) generation have enabled controllable and expressive music creation using natural language prompts. However, the emotional fidelity of TTM systems remains largely underexplored compared to human preference or text alignment. In this study, we introduce AImoclips, a benchmark for evaluating how well TTM systems convey intended emotions to human listeners, covering both open-source and commercial models. We selected 12 emotion intents spanning four quadrants of the valence-arousal space, and used six state-of-the-art TTM systems to generate over 1,000 music clips. A total of 111 participants rated the perceived valence and arousal of each clip on a 9-point Likert scale. Our results show that commercial systems tend to produce music perceived as more pleasant than intended, while open-source systems tend to perform the opposite. Emotions are more accurately conveyed under high-arousal conditions across all models. Additionally, all systems exhibit a bias toward emotional neutrality, highlighting a key limitation in affective controllability. This benchmark offers valuable insights into model-specific emotion rendering characteristics and supports future development of emotionally aligned TTM systems.
Problem

Research questions and friction points this paper is trying to address.

Evaluating emotion conveyance in text-to-music generation systems
Assessing emotional fidelity compared to human perception
Identifying biases toward emotional neutrality in generated music
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for evaluating emotion conveyance in TTM
Valence-arousal space emotion intent selection
Human perception rating system for music clips
🔎 Similar Papers
No similar papers found.