The AudioMOS Challenge 2025

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of automated subjective evaluation methods for synthetic audio by introducing the first multi-track benchmark covering text-to-speech, text-to-audio, and text-to-music generation. Methodologically: (1) it proposes a dual-dimensional assessment framework—overall quality and text-audio alignment; (2) grounded in Meta’s Audiobox aesthetic framework, it defines a four-dimensional scoring system—audio fidelity, naturalness, expressiveness, and faithfulness; (3) it conducts the first systematic investigation of sampling rate effects on synthetic speech quality. Its key contribution is the first multimodal, multidimensional, and multisampling-rate framework for subjective synthetic audio quality evaluation, employing a human-perception-driven, fine-grained annotation scheme. The benchmark attracted 24 academic and industrial teams; all tracks significantly outperformed baseline models, advancing audio generation evaluation from manual to automated and unidimensional to multidimensional paradigms.

Technology Category

Application Category

📝 Abstract
This is the summary paper for the AudioMOS Challenge 2025, the very first challenge for automatic subjective quality prediction for synthetic audio. The challenge consists of three tracks. The first track aims to assess text-to-music samples in terms of overall quality and textual alignment. The second track is based on the four evaluation dimensions of Meta Audiobox Aesthetics, and the test set consists of text-to-speech, text-to-audio, and text-to-music samples. The third track focuses on synthetic speech quality assessment in different sampling rates. The challenge attracted 24 unique teams from both academia and industry, and improvements over the baselines were confirmed. The outcome of this challenge is expected to facilitate development and progress in the field of automatic evaluation for audio generation systems.
Problem

Research questions and friction points this paper is trying to address.

Assessing text-to-music quality and textual alignment
Evaluating synthetic audio across multiple generation dimensions
Measuring synthetic speech quality at different sampling rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

automatic subjective quality prediction
text-to-music quality assessment
synthetic speech quality evaluation
🔎 Similar Papers
No similar papers found.