๐ค AI Summary
Existing evaluation metrics for text-to-audio generation predominantly rely on embedding similarity, which struggles to capture fine-grained semantic alignment and compositional reasoning capabilities, and exhibits limited correlation with human judgments. To address this, this work proposes AQAScoreโan architecture-agnostic evaluation framework that introduces an audio question-answering mechanism for the first time in this domain. AQAScore leverages audio-perceptive large language models (ALLMs) to perform probabilistic semantic verification by computing the log-probability of โYesโ responses to targeted semantic queries such as โDoes the audio contain the content described in the text?โ Experiments demonstrate that AQAScore significantly outperforms similarity-based metrics like CLAPScore and generative prompting baselines across multiple benchmarks, achieving high agreement with human ratings and effectively supporting compositional reasoning evaluation, with performance scaling alongside ALLM capabilities.
๐ Abstract
Although text-to-audio generation has made remarkable progress in realism and diversity, the development of evaluation metrics has not kept pace. Widely-adopted approaches, typically based on embedding similarity like CLAPScore, effectively measure general relevance but remain limited in fine-grained semantic alignment and compositional reasoning. To address this, we introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models (ALLMs). AQAScore reformulates assessment as a probabilistic semantic verification task; rather than relying on open-ended text generation, it estimates alignment by computing the exact log-probability of a"Yes"answer to targeted semantic queries. We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks. Experimental results show that AQAScore consistently achieves higher correlation with human judgments than similarity-based metrics and generative prompting baselines, showing its effectiveness in capturing subtle semantic inconsistencies and scaling with the capability of underlying ALLMs.