🤖 AI Summary
This work addresses the longstanding reliance on MIDI data in symbolic music research, which has largely overlooked the rich structural information embedded in human-readable score formats such as LilyPond. To bridge this gap, we introduce and publicly release BMdataset—a high-quality, musicologically informed dataset comprising 393 expert-transcribed LilyPond scores derived from Baroque manuscripts. Building upon this resource, we propose LilyBERT, an extension of CodeBERT augmented with 115 LilyPond-specific tokens and pretrained via masked language modeling. Evaluated through linear probing, LilyBERT fine-tuned solely on BMdataset outperforms models continuously pretrained on 15 billion tokens of general-purpose music corpora. Furthermore, combining general pretraining with domain-specific fine-tuning yields an 84.3% accuracy on composer classification, demonstrating the remarkable efficacy of small-scale, expert-annotated datasets for music understanding tasks.
📝 Abstract
Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at https://huggingface.co/csc-unipd/lilybert), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens) for both composer and style classification, demonstrating that small, expertly curated datasets can be more effective than large, noisy corpora for music understanding. Combining broad pre-training with domain-specific fine-tuning yields the best results overall (84.3% composer accuracy), confirming that the two data regimes are complementary. We release the dataset, tokenizer, and model to establish a baseline for representation learning on LilyPond.