🤖 AI Summary
Discrete representation learning for symbolic music remains challenging due to the tension between generation fidelity and semantic interpretability. Method: We propose MuseTok, a unified modeling framework integrating Residual Quantized Variational Autoencoders (RQ-VAEs) with Transformers. MuseTok is the first to apply RQ-VAEs for bar-wise music segmentation, jointly optimizing reconstruction quality and semantic interpretability in a discrete latent space. It incorporates music-theoretic constraints—explicitly modeling pitch, rhythm, and harmony—to establish strong correspondences between discrete tokens and musical primitives. Contribution/Results: MuseTok achieves state-of-the-art performance in semantic understanding tasks—including melody extraction, chord recognition, and emotion classification—while attaining superior reconstruction fidelity and generative quality. Extensive evaluation on synthetic data further validates its semantic interpretability, demonstrating that learned tokens align meaningfully with music-theoretic concepts.
📝 Abstract
Discrete representation learning has shown promising results across various domains, including generation and understanding in image, speech and language. Inspired by these advances, we propose MuseTok, a tokenization method for symbolic music, and investigate its effectiveness in both music generation and understanding tasks. MuseTok employs the residual vector quantized-variational autoencoder (RQ-VAE) on bar-wise music segments within a Transformer-based encoder-decoder framework, producing music codes that achieve high-fidelity music reconstruction and accurate understanding of music theory. For comprehensive evaluation, we apply MuseTok to music generation and semantic understanding tasks, including melody extraction, chord recognition, and emotion recognition. Models incorporating MuseTok outperform previous representation learning baselines in semantic understanding while maintaining comparable performance in content generation. Furthermore, qualitative analyses on MuseTok codes, using ground-truth categories and synthetic datasets, reveal that MuseTok effectively captures underlying musical concepts from large music collections.