Interpretable Audio Editing Evaluation via Chain-of-Thought Difference-Commonality Reasoning with Multimodal LLMs

📅 2025-09-21

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

This study addresses the lack of interpretability and human alignment in audio editing quality assessment by proposing the first natural language-based automated evaluation framework leveraging multimodal large language models (MLLMs). Methodologically, it innovatively integrates a difference-commonality reasoning mechanism with chain-of-thought (CoT) prompting, augmented by dual-task fine-tuning and lightweight instruction tuning to enable fine-grained understanding and stepwise reasoning over multiple audio inputs. The framework is the first to generate interpretable, text-based comparative evaluations. Experiments demonstrate that its outputs significantly outperform existing baselines in both subjective perceptual consistency and correlation with objective metrics, closely aligning with human judgments across diverse audio editing tasks. This work establishes a novel, transparent, accurate, and interpretable paradigm for trustworthy audio evaluation.

Technology Category

Application Category

📝 Abstract

Automatic mean opinion score (MOS) prediction provides a more perceptual alternative to objective metrics, offering deeper insights into the evaluated models. With the rapid progress of multimodal large language models (MLLMs), their enhanced perceptual and reasoning abilities enable more comprehensive and interpretable audio quality assessment. In this work, we tackle the challenging task of audio editing evaluation and propose the first natural language-based automated evaluation framework built on MLLMs. Our approach introduces two fine-tuning tasks to boost multi-audio understanding, combined with Chain-of-Thought prompting, and lightweight instruction tuning, to enhance step-by-step reasoning. Experiment demonstrate that our framework delivers accurate, interpretable, and text-based editing evaluation, closely aligning with human judgments and objective metrics while substantially improving over baselines. The code and demo are available at https://github.com/NKU-HLT/Eval_Reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating audio editing quality using interpretable automated assessment methods

Developing natural language-based evaluation framework with multimodal LLMs

Enhancing audio quality assessment through chain-of-thought reasoning techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLMs for audio quality assessment

Chain-of-Thought prompting for step-by-step reasoning

Fine-tuning tasks to boost multi-audio understanding

🔎 Similar Papers

No similar papers found.