Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation

📅 2024-11-19
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing TTS human evaluation methods suffer from three key limitations: MOS lacks discriminative power; CMOS is inefficient for large-scale assessment; and MUSHRA systematically penalizes superhuman-quality synthesis due to its mandatory anchoring to human reference recordings, while also being vulnerable to rater variability, auditory fatigue, and reference bias. To address these issues, we conduct a large-scale subjective study with 492 Hindi and Tamil listeners and propose two innovations: (1) a reference-decoupled MUSHRA variant that eliminates unjustified penalties against synthesis exceeding human quality; and (2) a structured rating guideline that substantially reduces inter-rater score variance. Concurrently, we introduce MANGO—the first high-quality, multilingual TTS subjective evaluation benchmark—comprising 246,000 human ratings on Indian language speech. MANGO enables fairer, finer-grained, and more reliable TTS assessment and facilitates robust automatic metric development.

Technology Category

Application Category

📝 Abstract
Despite rapid advancements in TTS models, a consistent and robust human evaluation framework is still lacking. For example, MOS tests fail to differentiate between similar models, and CMOS's pairwise comparisons are time-intensive. The MUSHRA test is a promising alternative for evaluating multiple TTS systems simultaneously, but in this work we show that its reliance on matching human reference speech unduly penalises the scores of modern TTS systems that can exceed human speech quality. More specifically, we conduct a comprehensive assessment of the MUSHRA test, focusing on its sensitivity to factors such as rater variability, listener fatigue, and reference bias. Based on our extensive evaluation involving 492 human listeners across Hindi and Tamil we identify two primary shortcomings: (i) reference-matching bias, where raters are unduly influenced by the human reference, and (ii) judgement ambiguity, arising from a lack of clear fine-grained guidelines. To address these issues, we propose two refined variants of the MUSHRA test. The first variant enables fairer ratings for synthesized samples that surpass human reference quality. The second variant reduces ambiguity, as indicated by the relatively lower variance across raters. By combining these approaches, we achieve both more reliable and more fine-grained assessments. We also release MANGO, a massive dataset of 246,000 human ratings, the first-of-its-kind collection for Indian languages, aiding in analyzing human preferences and developing automatic metrics for evaluating TTS systems.
Problem

Research questions and friction points this paper is trying to address.

Lack of robust human evaluation framework for TTS models
MUSHRA test penalizes modern TTS systems exceeding human quality
Reference bias and judgment ambiguity in TTS evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Refined MUSHRA test variants for fairer ratings
Addressed reference bias and judgement ambiguity issues
Released MANGO dataset for TTS evaluation metrics
🔎 Similar Papers
No similar papers found.
P
Praveena Varadhan
AI4Bharat, Indian Institute of Technology Madras
Amogh Gulati
Amogh Gulati
Gan.AI
Ashwin Sankar
Ashwin Sankar
MS by Research @ IIT Madras, AI4Bharat
Speech SynthesisSpeech TranslationMulti-modal AITTSLLM
Srija Anand
Srija Anand
MS by Research, AI4Bharat, IIT Madras
Speech SynthesisNatural Language ProcessingLLM Evaluation
Anirudh Gupta
Anirudh Gupta
Gan.AI
A
Anirudh Mukherjee
Gan.AI
S
Shiva Kumar Marepally
AI4Bharat, Indian Institute of Technology Madras
A
Ankur Bhatia
Gan.AI
S
Saloni Jaju
Gan.AI
Suvrat Bhooshan
Suvrat Bhooshan
Gan.ai, ex Facebook AI Research (FAIR)
Deep LearningComputer VisionMedical Imaging
M
Mitesh M. Khapra
AI4Bharat, Indian Institute of Technology Madras