Training chord recognition models on artificially generated audio

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Music information retrieval (MIR) faces a critical bottleneck due to the scarcity of copyright-compliant, real-world audio data. To address this, we propose training Transformer-based chord recognition models exclusively on synthetically generated multitrack audio. Our method constructs a high-fidelity synthetic music dataset and jointly trains two Transformer architectures—ChordBERT and ChordViT—using this synthetic data alongside real-world corpora: Schubert’s *Winterreise* and the McGill Billboard dataset. Experimental results demonstrate that models trained solely on synthetic data achieve strong performance across three standard evaluation metrics: Root, MajMin, and Chord Content. Moreover, such models not only serve as effective standalone predictors for pop music chord estimation but also substantially improve generalization when fine-tuned on small-scale real datasets. This work presents the first systematic validation of high-quality synthetic audio for chord recognition, establishing a scalable, data-efficient paradigm for low-resource MIR tasks.

Technology Category

Application Category

📝 Abstract

One of the challenging problems in Music Information Retrieval is the acquisition of enough non-copyrighted audio recordings for model training and evaluation. This study compares two Transformer-based neural network models for chord sequence recognition in audio recordings and examines the effectiveness of using an artificially generated dataset for this purpose. The models are trained on various combinations of Artificial Audio Multitracks (AAM), Schubert's Winterreise Dataset, and the McGill Billboard Dataset and evaluated with three metrics: Root, MajMin and Chord Content Metric (CCM). The experiments prove that even though there are certainly differences in complexity and structure between artificially generated and human-composed music, the former can be useful in certain scenarios. Specifically, AAM can enrich a smaller training dataset of music composed by a human or can even be used as a standalone training set for a model that predicts chord sequences in pop music, if no other data is available.

Problem

Research questions and friction points this paper is trying to address.

Addressing shortage of non-copyrighted audio for chord recognition training

Evaluating Transformer models on artificial vs human-composed music datasets

Assessing artificial audio's utility for pop music chord prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Transformer-based models for chord recognition

Trains on artificially generated audio datasets

Evaluates with Root, MajMin, and CCM metrics

🔎 Similar Papers

No similar papers found.