The AudioMOS Challenge 2025

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This study addresses the lack of automated subjective evaluation methods for synthetic audio by introducing the first multi-track benchmark covering text-to-speech, text-to-audio, and text-to-music generation. Methodologically: (1) it proposes a dual-dimensional assessment framework—overall quality and text-audio alignment; (2) grounded in Meta’s Audiobox aesthetic framework, it defines a four-dimensional scoring system—audio fidelity, naturalness, expressiveness, and faithfulness; (3) it conducts the first systematic investigation of sampling rate effects on synthetic speech quality. Its key contribution is the first multimodal, multidimensional, and multisampling-rate framework for subjective synthetic audio quality evaluation, employing a human-perception-driven, fine-grained annotation scheme. The benchmark attracted 24 academic and industrial teams; all tracks significantly outperformed baseline models, advancing audio generation evaluation from manual to automated and unidimensional to multidimensional paradigms.

Technology Category

Application Category

📝 Abstract

This is the summary paper for the AudioMOS Challenge 2025, the very first challenge for automatic subjective quality prediction for synthetic audio. The challenge consists of three tracks. The first track aims to assess text-to-music samples in terms of overall quality and textual alignment. The second track is based on the four evaluation dimensions of Meta Audiobox Aesthetics, and the test set consists of text-to-speech, text-to-audio, and text-to-music samples. The third track focuses on synthetic speech quality assessment in different sampling rates. The challenge attracted 24 unique teams from both academia and industry, and improvements over the baselines were confirmed. The outcome of this challenge is expected to facilitate development and progress in the field of automatic evaluation for audio generation systems.

Problem

Research questions and friction points this paper is trying to address.

Assessing text-to-music quality and textual alignment

Evaluating synthetic audio across multiple generation dimensions

Measuring synthetic speech quality at different sampling rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

automatic subjective quality prediction

text-to-music quality assessment

synthetic speech quality evaluation

🔎 Similar Papers

Audio Anti-Spoofing Detection: A Survey

2024-04-22arXiv.orgCitations: 25