Myna: Masking-Based Contrastive Learning of Musical Representations

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address inefficiency, weak pitch sensitivity, and suboptimal downstream performance in music representation learning, this paper proposes Myna—a contrastive learning framework leveraging mask-based augmentation. Methodologically, Myna employs a Vision Transformer (ViT) to model mel-spectrograms and introduces a novel spectrogram token masking strategy with a 90% masking ratio to accelerate training. It further designs a vertical patching mechanism to enhance tonal perception and adopts a hybrid-scale patch embedding scheme (16×16 and 128×2) to jointly capture local acoustic details and global structural patterns. Experiments demonstrate that Myna achieves state-of-the-art performance on critical detection tasks while requiring only a single GPU for training. Its parameter count is significantly lower than baseline models—outperforming the 62M-parameter MULE and matching the performance of multi-GPU-trained MERT-95M (trained on 16 or 64 GPUs)—thereby establishing a new efficiency–performance trade-off benchmark in self-supervised music representation learning.

Technology Category

Application Category

📝 Abstract
We present Myna, a simple yet effective approach for self-supervised musical representation learning. Built on a contrastive learning framework, Myna introduces two key innovations: (1) the use of a Vision Transformer (ViT) on mel-spectrograms as the backbone and (2) a novel data augmentation strategy, token masking, that masks 90 percent of spectrogram tokens. These innovations deliver both effectiveness and efficiency: (i) Token masking enables a significant increase in per-GPU batch size, from 48 or 120 in prior methods (CLMR, MULE) to 4096. (ii) By avoiding traditional augmentations, Myna retains pitch sensitivity, enhancing performance in tasks like key detection. (iii) The use of vertical patches allows the model to better capture critical features for key detection. Our hybrid model, Myna-22M-Hybrid, processes both 16x16 and 128x2 patches, achieving state-of-the-art results. Trained on a single GPU, it outperforms MULE (62M) on average and rivals MERT-95M, which was trained on 16 and 64 GPUs, respectively. Additionally, it surpasses MERT-95M-public, establishing itself as the best-performing model trained on publicly available data. We release our code and models to promote reproducibility and facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

Self-supervised musical representation learning
Contrastive learning with token masking
Vision Transformer on mel-spectrograms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformer on mel-spectrograms
Token masking data augmentation
Hybrid model with vertical patches
🔎 Similar Papers
No similar papers found.