🤖 AI Summary
Existing music source separation models (e.g., Spleeter, Hybrid Demucs) are trained predominantly on Western commercial datasets and exhibit significant performance degradation on non-Western traditional music—such as Carnatic music—due to severe inter-instrument leakage and lack of domain-appropriate training data.
Method: We introduce the first high-quality multimodal dataset specifically designed for Carnatic music, comprising low-leakage multitrack audio recordings synchronized with high-definition performance videos. Building upon this, we propose a domain-adaptive fine-tuning strategy based on the Spleeter architecture, rigorously evaluated via both objective SDR metrics and subjective listening tests.
Contribution/Results: Our fine-tuned model achieves substantial SDR improvements over prior fine-tuning approaches on existing Carnatic datasets; listening experiments further confirm perceptible quality gains. This work establishes a foundational benchmark dataset, a tailored adaptation methodology, and a comprehensive evaluation framework for source separation in traditional music.
📝 Abstract
Music source separation demixes a piece of music into its individual sound sources (vocals, percussion, melodic instruments, etc.), a task with no simple mathematical solution. It requires deep learning methods involving training on large datasets of isolated music stems. The most commonly available datasets are made from commercial Western music, limiting the models' applications to non-Western genres like Carnatic music. Carnatic music is a live tradition, with the available multi-track recordings containing overlapping sounds and bleeds between the sources. This poses a challenge to commercially available source separation models like Spleeter and Hybrid Demucs. In this work, we introduce 'Sanidha', the first open-source novel dataset for Carnatic music, offering studio-quality, multi-track recordings with minimal to no overlap or bleed. Along with the audio files, we provide high-definition videos of the artists' performances. Additionally, we fine-tuned Spleeter, one of the most commonly used source separation models, on our dataset and observed improved SDR performance compared to fine-tuning on a pre-existing Carnatic multi-track dataset. The outputs of the fine-tuned model with 'Sanidha' are evaluated through a listening study.