Can LLMs understand LilyPond? A benchmark for symbolic music generation and understanding

📅 2026-06-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current evaluations of symbolic music in large language models suffer from a lack of standardized representations, datasets, and metrics. To address this gap, this work proposes LilyBench—the first comprehensive benchmark based on LilyPond—which unifies music generation and understanding tasks within a single symbolic framework and introduces a multidimensional evaluation strategy. Integrating LilyPond notation, MusPy descriptors, Jensen-Shannon similarity, LilyBERT embeddings, and Fréchet Music Distance, the benchmark enables compilable score generation under zero-shot settings. Experimental results demonstrate that existing open-source large models excel at composer and genre identification but still face significant challenges in structural music understanding. The study also reveals systematic discrepancies among different evaluation metrics, highlighting the need for more coherent assessment methodologies in symbolic music modeling.

📝 Abstract

Symbolic music evaluation for large language models remains fragmented across representations, datasets, and metrics. We introduce LilyBench, a LilyPond-based benchmark that jointly evaluates symbolic music generation and music understanding on the same family of open-weight LLMs. The benchmark includes a 200-prompt generation suite and ten understanding tasks adapted from ABC-Eval, covering syntax, metadata prediction, structural sequencing, and music recognition. Generation quality is evaluated using compile rate, MusPy descriptor distributions via Jensen-Shannon similarity, and LilyBERT-based Fréchet Music Distance (FMD). Experiments on four open-weight models show that executable LilyPond generation is achievable in zero-shot settings, while structural understanding tasks remain challenging despite strong performance on composer and genre recognition. Our experiments also reveal systematic disagreements between descriptor-based and embedding-based metrics, suggesting that symbolic music evaluation benefits from metric triangulation rather than single-score ranking. We release the benchmark, prompt bank, and evaluation code to support future research in symbolic music generation and understanding at https://github.com/CSCPadova/lilybench

Problem

Research questions and friction points this paper is trying to address.

symbolic music

large language models

evaluation benchmark

music generation

music understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

LilyBench

symbolic music generation

music understanding