🤖 AI Summary
This paper addresses the scarcity of labeled data in modeling similarity between music and vocals. We propose a contrastive self-supervised learning framework for music representation learning. Our key innovation lies in a dual-path contrastive strategy: a label-aware path that constructs positive/negative pairs using artist identity, and a label-agnostic path that enables fully unsupervised pretraining via controllable synthesis—mixing real vocals with synthetic accompaniments. The model employs an audio-level contrastive learning architecture and is evaluated via linear probing and user studies. Experiments demonstrate substantial improvements over state-of-the-art baselines on downstream tasks—including artist-specific vocal identification and cross-style vocal matching—as well as perceptual similarity assessment. Notably, the label-agnostic variant achieves performance comparable to its label-aware counterpart without any annotations, significantly enhancing generalizability and practical applicability.
📝 Abstract
The availability of large, unlabeled datasets across various domains has contributed to the development of a plethora of methods that learn representations for multiple target (downstream) tasks through self-supervised pre-training. In this work, we introduce CVSM (Contrastive Vocal Similarity Modeling), a contrastive self-supervised procedure for music signal representation learning in the audio domain that can be utilized for musical and vocal similarity modeling. Our method operates under a contrastive framework, maximizing the similarity between vocal excerpts and musical mixtures containing the same vocals; we devise both a label-informed protocol, leveraging artist identity information to sample the contrastive pairs, and a label-agnostic scheme, involving artificial mixture creation from randomly sampled vocal and accompaniment excerpts, which are paired with vocals from the same audio segment. We evaluate our proposed method in measuring vocal similarity both objectively, through linear probing on a suite of appropriate downstream tasks, and subjectively, via conducting a user study consisting of pairwise comparisons between different models in a recommendation-by-query setting. Our results indicate that the representations learned through CVSM are effective in musical and vocal similarity modeling, outperforming numerous baselines across both isolated vocals and complete musical mixtures. Moreover, while the availability of artist identity labels during pre-training leads to overall more consistent performance both in the evaluated downstream tasks and the user study, a label-agnostic CVSM variant incorporating hybrid pre-training with real and artificial mixtures achieves comparable performance to the label-informed one in artist identification and perceived vocal similarity.