π€ AI Summary
This work addresses the challenge of semantic inconsistency in text-to-audio generation, a problem exacerbated by the reliance on inefficient human listening tests and the absence of objective, automated evaluation metrics. To this end, the paper proposes a novel assessment model that integrates a Mixture-of-Experts (MoE) architecture with a sequential cross-attention mechanism (SeqCoAttn), marking the first application of this combined approach to modeling semantic alignment between text and audio. Evaluated on the XACLE challenge test set, the proposed method achieves a Spearmanβs rank correlation coefficient (SRCC) of 0.6402, representing a 30.6% improvement over the baseline and outperforming all existing methods to secure first place. This advancement establishes an efficient, objective paradigm for evaluating text-to-audio generation quality.
π Abstract
Recent advances in generative models have enabled modern Text-to-Audio (TTA) systems to synthesize audio with high perceptual quality. However, TTA systems often struggle to maintain semantic consistency with the input text, leading to mismatches in sound events, temporal tructures, or contextual relationships. Evaluating semantic fidelity in TTA remains a significant challenge. Traditional methods primarily rely on subjective human listening tests, which is time-consuming. To solve this, we propose an objective evaluator based on a Mixture of Experts (MoE) architecture with Sequential Cross-Attention (SeqCoAttn). Our model achieves the first rank in the XACLE Challenge, with an SRCC of 0.6402 (an improvement of 30.6% over the challenge baseline) on the test dataset. Code is available at: https://github.com/S-Orion/MOESCORE.