🤖 AI Summary
Evaluating audio generation quality is challenging due to the subjective and multidimensional nature of human perception; existing regression-based Mean Opinion Score (MOS) prediction methods overlook the relative nature of human judgments. This paper proposes a Quality-Aware Adaptive Margin Ranking Optimization framework that reformulates MOS prediction as a pairwise ranking task with quality-aware weighting, leveraging features from multimodal pretrained models—including CLAP and AudioBox-Aesthetics—in an end-to-end trainable architecture. Key innovations include: (i) an adaptive margin mechanism that dynamically models perceptual dissimilarity across sample pairs; and (ii) a quality-aware weighting strategy that emphasizes high-confidence preference alignments. Evaluated on the AudioMOS Challenge 2025 dataset, our method significantly outperforms strong baselines, achieving state-of-the-art performance in Pearson and Spearman correlation coefficients (>0.92) and Top-1 accuracy, demonstrating exceptional agreement with human ratings.
📝 Abstract
Evaluating audio generation systems, including text-to-music (TTM), text-to-speech (TTS), and text-to-audio (TTA), remains challenging due to the subjective and multi-dimensional nature of human perception. Existing methods treat mean opinion score (MOS) prediction as a regression problem, but standard regression losses overlook the relativity of perceptual judgments. To address this limitation, we introduce QAMRO, a novel Quality-aware Adaptive Margin Ranking Optimization framework that seamlessly integrates regression objectives from different perspectives, aiming to highlight perceptual differences and prioritize accurate ratings. Our framework leverages pre-trained audio-text models such as CLAP and Audiobox-Aesthetics, and is trained exclusively on the official AudioMOS Challenge 2025 dataset. It demonstrates superior alignment with human evaluations across all dimensions, significantly outperforming robust baseline models.