MoEScore: Mixture-of-Experts-Based Text-Audio Relevance Score Prediction for Text-to-Audio System Evaluation

📅 2026-01-11

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of semantic inconsistency in text-to-audio generation, a problem exacerbated by the reliance on inefficient human listening tests and the absence of objective, automated evaluation metrics. To this end, the paper proposes a novel assessment model that integrates a Mixture-of-Experts (MoE) architecture with a sequential cross-attention mechanism (SeqCoAttn), marking the first application of this combined approach to modeling semantic alignment between text and audio. Evaluated on the XACLE challenge test set, the proposed method achieves a Spearman’s rank correlation coefficient (SRCC) of 0.6402, representing a 30.6% improvement over the baseline and outperforming all existing methods to secure first place. This advancement establishes an efficient, objective paradigm for evaluating text-to-audio generation quality.

Technology Category

Application Category

📝 Abstract

Recent advances in generative models have enabled modern Text-to-Audio (TTA) systems to synthesize audio with high perceptual quality. However, TTA systems often struggle to maintain semantic consistency with the input text, leading to mismatches in sound events, temporal tructures, or contextual relationships. Evaluating semantic fidelity in TTA remains a significant challenge. Traditional methods primarily rely on subjective human listening tests, which is time-consuming. To solve this, we propose an objective evaluator based on a Mixture of Experts (MoE) architecture with Sequential Cross-Attention (SeqCoAttn). Our model achieves the first rank in the XACLE Challenge, with an SRCC of 0.6402 (an improvement of 30.6% over the challenge baseline) on the test dataset. Code is available at: https://github.com/S-Orion/MOESCORE.

Problem

Research questions and friction points this paper is trying to address.

Text-to-Audio

semantic consistency

audio-text alignment

system evaluation

generative models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts

Text-to-Audio

Semantic Fidelity

Sequential Cross-Attention

Objective Evaluation

🔎 Similar Papers

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

2024-09-01InterspeechCitations: 2

A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining

2024-07-06arXiv.orgCitations: 3

Authors to Follow