MoEScore: Mixture-of-Experts-Based Text-Audio Relevance Score Prediction for Text-to-Audio System Evaluation

πŸ“… 2026-01-11
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of semantic inconsistency in text-to-audio generation, a problem exacerbated by the reliance on inefficient human listening tests and the absence of objective, automated evaluation metrics. To this end, the paper proposes a novel assessment model that integrates a Mixture-of-Experts (MoE) architecture with a sequential cross-attention mechanism (SeqCoAttn), marking the first application of this combined approach to modeling semantic alignment between text and audio. Evaluated on the XACLE challenge test set, the proposed method achieves a Spearman’s rank correlation coefficient (SRCC) of 0.6402, representing a 30.6% improvement over the baseline and outperforming all existing methods to secure first place. This advancement establishes an efficient, objective paradigm for evaluating text-to-audio generation quality.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in generative models have enabled modern Text-to-Audio (TTA) systems to synthesize audio with high perceptual quality. However, TTA systems often struggle to maintain semantic consistency with the input text, leading to mismatches in sound events, temporal tructures, or contextual relationships. Evaluating semantic fidelity in TTA remains a significant challenge. Traditional methods primarily rely on subjective human listening tests, which is time-consuming. To solve this, we propose an objective evaluator based on a Mixture of Experts (MoE) architecture with Sequential Cross-Attention (SeqCoAttn). Our model achieves the first rank in the XACLE Challenge, with an SRCC of 0.6402 (an improvement of 30.6% over the challenge baseline) on the test dataset. Code is available at: https://github.com/S-Orion/MOESCORE.
Problem

Research questions and friction points this paper is trying to address.

Text-to-Audio
semantic consistency
audio-text alignment
system evaluation
generative models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts
Text-to-Audio
Semantic Fidelity
Sequential Cross-Attention
Objective Evaluation