QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

210K/year
🤖 AI Summary
Evaluating audio generation quality is challenging due to the subjective and multidimensional nature of human perception; existing regression-based Mean Opinion Score (MOS) prediction methods overlook the relative nature of human judgments. This paper proposes a Quality-Aware Adaptive Margin Ranking Optimization framework that reformulates MOS prediction as a pairwise ranking task with quality-aware weighting, leveraging features from multimodal pretrained models—including CLAP and AudioBox-Aesthetics—in an end-to-end trainable architecture. Key innovations include: (i) an adaptive margin mechanism that dynamically models perceptual dissimilarity across sample pairs; and (ii) a quality-aware weighting strategy that emphasizes high-confidence preference alignments. Evaluated on the AudioMOS Challenge 2025 dataset, our method significantly outperforms strong baselines, achieving state-of-the-art performance in Pearson and Spearman correlation coefficients (>0.92) and Top-1 accuracy, demonstrating exceptional agreement with human ratings.

Technology Category

Application Category

📝 Abstract
Evaluating audio generation systems, including text-to-music (TTM), text-to-speech (TTS), and text-to-audio (TTA), remains challenging due to the subjective and multi-dimensional nature of human perception. Existing methods treat mean opinion score (MOS) prediction as a regression problem, but standard regression losses overlook the relativity of perceptual judgments. To address this limitation, we introduce QAMRO, a novel Quality-aware Adaptive Margin Ranking Optimization framework that seamlessly integrates regression objectives from different perspectives, aiming to highlight perceptual differences and prioritize accurate ratings. Our framework leverages pre-trained audio-text models such as CLAP and Audiobox-Aesthetics, and is trained exclusively on the official AudioMOS Challenge 2025 dataset. It demonstrates superior alignment with human evaluations across all dimensions, significantly outperforming robust baseline models.
Problem

Research questions and friction points this paper is trying to address.

Address subjectivity in audio generation evaluation
Improve mean opinion score prediction accuracy
Align assessment with human perception relativity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quality-aware Adaptive Margin Ranking Optimization
Integrates regression objectives from perspectives
Leverages pre-trained audio-text models
🔎 Similar Papers
No similar papers found.