🤖 AI Summary
This study addresses the lack of interpretability and human alignment in audio editing quality assessment by proposing the first natural language-based automated evaluation framework leveraging multimodal large language models (MLLMs). Methodologically, it innovatively integrates a difference-commonality reasoning mechanism with chain-of-thought (CoT) prompting, augmented by dual-task fine-tuning and lightweight instruction tuning to enable fine-grained understanding and stepwise reasoning over multiple audio inputs. The framework is the first to generate interpretable, text-based comparative evaluations. Experiments demonstrate that its outputs significantly outperform existing baselines in both subjective perceptual consistency and correlation with objective metrics, closely aligning with human judgments across diverse audio editing tasks. This work establishes a novel, transparent, accurate, and interpretable paradigm for trustworthy audio evaluation.
📝 Abstract
Automatic mean opinion score (MOS) prediction provides a more perceptual alternative to objective metrics, offering deeper insights into the evaluated models. With the rapid progress of multimodal large language models (MLLMs), their enhanced perceptual and reasoning abilities enable more comprehensive and interpretable audio quality assessment. In this work, we tackle the challenging task of audio editing evaluation and propose the first natural language-based automated evaluation framework built on MLLMs. Our approach introduces two fine-tuning tasks to boost multi-audio understanding, combined with Chain-of-Thought prompting, and lightweight instruction tuning, to enhance step-by-step reasoning. Experiment demonstrate that our framework delivers accurate, interpretable, and text-based editing evaluation, closely aligning with human judgments and objective metrics while substantially improving over baselines. The code and demo are available at https://github.com/NKU-HLT/Eval_Reasoning.