QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating audio generation quality is challenging due to the subjective and multidimensional nature of human perception; existing regression-based Mean Opinion Score (MOS) prediction methods overlook the relative nature of human judgments. This paper proposes a Quality-Aware Adaptive Margin Ranking Optimization framework that reformulates MOS prediction as a pairwise ranking task with quality-aware weighting, leveraging features from multimodal pretrained models—including CLAP and AudioBox-Aesthetics—in an end-to-end trainable architecture. Key innovations include: (i) an adaptive margin mechanism that dynamically models perceptual dissimilarity across sample pairs; and (ii) a quality-aware weighting strategy that emphasizes high-confidence preference alignments. Evaluated on the AudioMOS Challenge 2025 dataset, our method significantly outperforms strong baselines, achieving state-of-the-art performance in Pearson and Spearman correlation coefficients (>0.92) and Top-1 accuracy, demonstrating exceptional agreement with human ratings.

Technology Category

Application Category

📝 Abstract
Evaluating audio generation systems, including text-to-music (TTM), text-to-speech (TTS), and text-to-audio (TTA), remains challenging due to the subjective and multi-dimensional nature of human perception. Existing methods treat mean opinion score (MOS) prediction as a regression problem, but standard regression losses overlook the relativity of perceptual judgments. To address this limitation, we introduce QAMRO, a novel Quality-aware Adaptive Margin Ranking Optimization framework that seamlessly integrates regression objectives from different perspectives, aiming to highlight perceptual differences and prioritize accurate ratings. Our framework leverages pre-trained audio-text models such as CLAP and Audiobox-Aesthetics, and is trained exclusively on the official AudioMOS Challenge 2025 dataset. It demonstrates superior alignment with human evaluations across all dimensions, significantly outperforming robust baseline models.
Problem

Research questions and friction points this paper is trying to address.

Address subjectivity in audio generation evaluation
Improve mean opinion score prediction accuracy
Align assessment with human perception relativity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quality-aware Adaptive Margin Ranking Optimization
Integrates regression objectives from perspectives
Leverages pre-trained audio-text models
🔎 Similar Papers
No similar papers found.
Chien-Chun Wang
Chien-Chun Wang
National Taiwan Normal University
Speech EnhancementSpeech RecognitionVoice Activity DetectionSpeech Quality Assessment
K
Kuan-Tang Huang
Dept. Computer Science and Information Engineering, National Taiwan Normal University, Taiwan
C
Cheng-Yeh Yang
Dept. Computer Science and Information Engineering, National Taiwan Normal University, Taiwan
Hung-Shin Lee
Hung-Shin Lee
North Co., Ltd., Taiwan
Speech Processing
Hsin-Min Wang
Hsin-Min Wang
Research Fellow/Professor, Institute of Information Sience, Academia Sinica
Spoken Language ProcessingNatural Language ProcessingMultimedia Information RetrievalMachine Learning
B
Berlin Chen
Dept. Computer Science and Information Engineering, National Taiwan Normal University, Taiwan