🤖 AI Summary
In text-to-audio (TTA) generation, evaluating text–audio relevance has long relied on costly human assessments or objective metrics of questionable validity (e.g., CLAPScore). To address this, we introduce RELATE—the first open-source, human-annotated subjective evaluation dataset for TTA relevance assessment, covering diverse acoustic categories and providing fine-grained relevance scores. Leveraging RELATE, we train an end-to-end deep learning model to predict human relevance judgments automatically. Experiments demonstrate that our model significantly outperforms CLAPScore across all sound categories, achieves consistently high performance, and exhibits strong agreement with human raters (Spearman ρ > 0.72). This work establishes the first standardized subjective benchmark for TTA relevance evaluation and provides a reliable, automated assessment tool—thereby filling two critical gaps in the field.
📝 Abstract
In text-to-audio (TTA) research, the relevance between input text and output audio is an important evaluation aspect. Traditionally, it has been evaluated from both subjective and objective perspectives. However, subjective evaluation is costly in terms of money and time, and objective evaluation is unclear regarding the correlation to subjective evaluation scores. In this study, we construct RELATE, an open-sourced dataset that subjectively evaluates the relevance. Also, we benchmark a model for automatically predicting the subjective evaluation score from synthesized audio. Our model outperforms a conventional CLAPScore model, and that trend extends to many sound categories.