🤖 AI Summary
To address the challenges of modal feature modeling and emotional expression in Chinese traditional music generation, this paper proposes a dual-feature modeling paradigm that synergistically integrates Mamba’s capability for long-range dependency modeling with Transformer’s global structural awareness. We innovatively design a bidirectional Mamba fusion layer and introduce REMI-M—a modality-enhanced symbolic representation tailored for Chinese tonal systems. Furthermore, we present FolkDB, the first high-quality Chinese folk music dataset (11.3 hours), curated to support culturally grounded model training. Our architecture combines Mamba blocks, Transformer blocks, and a bidirectional scanning mechanism within a self-supervised sequence modeling framework. Experiments demonstrate substantial improvements in modal recognition accuracy (+12.7% absolute) and melodic cultural fidelity, achieving state-of-the-art performance across diverse Chinese traditional music generation tasks.
📝 Abstract
In recent years, deep learning has significantly advanced the MIDI domain, solidifying music generation as a key application of artificial intelligence. However, existing research primarily focuses on Western music and encounters challenges in generating melodies for Chinese traditional music, especially in capturing modal characteristics and emotional expression. To address these issues, we propose a new architecture, the Dual-Feature Modeling Module, which integrates the long-range dependency modeling of the Mamba Block with the global structure capturing capabilities of the Transformer Block. Additionally, we introduce the Bidirectional Mamba Fusion Layer, which integrates local details and global structures through bidirectional scanning, enhancing the modeling of complex sequences. Building on this architecture, we propose the REMI-M representation, which more accurately captures and generates modal information in melodies. To support this research, we developed FolkDB, a high-quality Chinese traditional music dataset encompassing various styles and totaling over 11 hours of music. Experimental results demonstrate that the proposed architecture excels in generating melodies with Chinese traditional music characteristics, offering a new and effective solution for music generation.