Synthesizing Sheet Music Problems for Evaluation and Reinforcement Learning

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A lack of standardized benchmarks and training data hinders systematic evaluation and improvement of large language models (LLMs) and multimodal large language models (MLLMs) on music score understanding. Method: We introduce the first music-theory-guided, rule-based synthesis method for generating verifiable music score reasoning tasks, enabling construction of the multimodal benchmark SSMR-Bench and its associated training dataset. We propose a novel text–vision joint data synthesis framework and enhance model reasoning via reinforcement learning with verifiable rewards (RLVR) and role-playing–augmented chain-of-thought prompting. Contribution/Results: Fine-tuned Qwen3-8B-Base achieves state-of-the-art performance on both SSMR-Bench and MusicTheoryBench—outperforming baseline models and even GPT-4. Notably, its mathematical reasoning capability also improves, demonstrating cross-domain transferability. This work establishes a reproducible, verifiable foundation for AI-driven music understanding and composition.

Technology Category

Application Category

📝 Abstract
Enhancing the ability of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to interpret sheet music is a crucial step toward building AI musicians. However, current research lacks both evaluation benchmarks and training data for sheet music reasoning. To address this, we propose the idea of synthesizing sheet music problems grounded in music theory, which can serve both as evaluation benchmarks and as training data for reinforcement learning with verifiable rewards (RLVR). We introduce a data synthesis framework that generates verifiable sheet music questions in both textual and visual modalities, leading to the Synthetic Sheet Music Reasoning Benchmark (SSMR-Bench) and a complementary training set. Evaluation results on SSMR-Bench show the importance of models' reasoning abilities in interpreting sheet music. At the same time, the poor performance of Gemini 2.5-Pro highlights the challenges that MLLMs still face in interpreting sheet music in a visual format. By leveraging synthetic data for RLVR, Qwen3-8B-Base and Qwen2.5-VL-Instruct achieve improvements on the SSMR-Bench. Besides, the trained Qwen3-8B-Base surpasses GPT-4 in overall performance on MusicTheoryBench and achieves reasoning performance comparable to GPT-4 with the strategies of Role play and Chain-of-Thought. Notably, its performance on math problems also improves relative to the original Qwen3-8B-Base. Furthermore, our results show that the enhanced reasoning ability can also facilitate music composition. In conclusion, we are the first to propose the idea of synthesizing sheet music problems based on music theory rules, and demonstrate its effectiveness not only in advancing model reasoning for sheet music understanding but also in unlocking new possibilities for AI-assisted music creation.
Problem

Research questions and friction points this paper is trying to address.

Addressing lack of evaluation benchmarks for sheet music reasoning
Synthesizing verifiable sheet music problems using music theory
Enhancing multimodal models' sheet music interpretation through synthetic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesizing verifiable sheet music problems using music theory
Creating multimodal evaluation benchmarks and training datasets
Using synthetic data for reinforcement learning with verifiable rewards
🔎 Similar Papers
No similar papers found.