QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Evaluating audio generation quality is challenging due to the subjective and multidimensional nature of human perception; existing regression-based Mean Opinion Score (MOS) prediction methods overlook the relative nature of human judgments. This paper proposes a Quality-Aware Adaptive Margin Ranking Optimization framework that reformulates MOS prediction as a pairwise ranking task with quality-aware weighting, leveraging features from multimodal pretrained models—including CLAP and AudioBox-Aesthetics—in an end-to-end trainable architecture. Key innovations include: (i) an adaptive margin mechanism that dynamically models perceptual dissimilarity across sample pairs; and (ii) a quality-aware weighting strategy that emphasizes high-confidence preference alignments. Evaluated on the AudioMOS Challenge 2025 dataset, our method significantly outperforms strong baselines, achieving state-of-the-art performance in Pearson and Spearman correlation coefficients (>0.92) and Top-1 accuracy, demonstrating exceptional agreement with human ratings.

Technology Category

Application Category

📝 Abstract

Evaluating audio generation systems, including text-to-music (TTM), text-to-speech (TTS), and text-to-audio (TTA), remains challenging due to the subjective and multi-dimensional nature of human perception. Existing methods treat mean opinion score (MOS) prediction as a regression problem, but standard regression losses overlook the relativity of perceptual judgments. To address this limitation, we introduce QAMRO, a novel Quality-aware Adaptive Margin Ranking Optimization framework that seamlessly integrates regression objectives from different perspectives, aiming to highlight perceptual differences and prioritize accurate ratings. Our framework leverages pre-trained audio-text models such as CLAP and Audiobox-Aesthetics, and is trained exclusively on the official AudioMOS Challenge 2025 dataset. It demonstrates superior alignment with human evaluations across all dimensions, significantly outperforming robust baseline models.

Problem

Research questions and friction points this paper is trying to address.

Address subjectivity in audio generation evaluation

Improve mean opinion score prediction accuracy

Align assessment with human perception relativity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quality-aware Adaptive Margin Ranking Optimization

Integrates regression objectives from perspectives

Leverages pre-trained audio-text models

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

Member of Technical Staff - Multi-Modal - Audio

Liquid AI

Competitive base salary with equity in a unicorn-stage company

San Francisco / Boston

Research Scientist Intern, Multimodal AI (PhD)