A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining

📅 2024-07-06
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
In language-guided audio source separation (LASS), the absence of reference audio and the challenge of jointly evaluating textual semantics and separation quality hinder objective assessment. To address this, we propose CLAPScore—the first reference-free evaluation metric for LASS built upon Contrastive Language–Audio Pretraining (CLAP). CLAPScore computes cross-modal semantic similarity between the separated audio and the input text query, enabling fully reference-free, semantics-driven objective evaluation without requiring ground-truth source signals. Unlike conventional waveform-based metrics such as SDR, CLAPScore is the first to leverage CLAP for LASS evaluation, overcoming the fundamental limitation of text-agnostic metrics. Experiments across multiple LASS benchmarks demonstrate that CLAPScore significantly outperforms SDR and achieves strong agreement with human judgments of semantic relevance (Spearman’s ρ > 0.85). The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Language-queried audio source separation (LASS) aims to separate an audio source guided by a text query, with the signal-to-distortion ratio (SDR)-based metrics being commonly used to objectively measure the quality of the separated audio. However, the SDR-based metrics require a reference signal, which is often difficult to obtain in real-world scenarios. In addition, with the SDR-based metrics, the content information of the text query is not considered effectively in LASS. This paper introduces a reference-free evaluation metric using a contrastive language-audio pretraining (CLAP) module, termed CLAPScore, which measures the semantic similarity between the separated audio and the text query. Unlike SDR, the proposed CLAPScore metric evaluates the quality of the separated audio based on the content information of the text query, without needing a reference signal. Experiments show that the CLAPScore provides an effective evaluation of the semantic relevance of the separated audio to the text query, as compared to the SDR metric, offering an alternative for the performance evaluation of LASS systems. The code for evaluation is publicly available.
Problem

Research questions and friction points this paper is trying to address.

Audio Separation
Speech Recognition
Text-audio Enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

CLAPScore
Audio-Text Matching
LASS Evaluation
🔎 Similar Papers
No similar papers found.
Feiyang Xiao
Feiyang Xiao
Group of Intelligent Signal Processing (GISP), Harbin Engineering University
Detection and Classification of Acoustic Scenes and EventsAudio-Text Multi-Modality Learning
J
Jian Guan
College of Computer Science and Technology, Harbin Engineering University, Harbin, China
Q
Qiaoxi Zhu
University of Technology Sydney, Ultimo, Australia
X
Xubo Liu
Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK
W
Wenbo Wang
Faculty of Computing, Harbin Institute of Technology, Harbin, China
S
Shuhan Qi
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
K
Kejia Zhang
College of Computer Science and Technology, Harbin Engineering University, Harbin, China
J
Jianyuan Sun
Department of Computer Science, University of Sheffield, Sheffield, UK
Wenwu Wang
Wenwu Wang
Professor, University of Surrey, UK
signal processingmachine learningmachine listeningaudio/speech/audio-visualmultimodal fusion