Evaluating Multimodal Large Language Models on Core Music Perception Tasks

📅 2025-10-25

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This study systematically evaluates multimodal large language models (MLLMs) on three core music perception tasks—segmentation scoring, key transposition detection, and chord recognition—focusing on the performance gap between audio and MIDI inputs. Method: It introduces the LogicLM framework to music for the first time, explicitly decoupling perception and reasoning modules to establish an audio-first, structured evaluation paradigm. Gemini Pro and Qwen2.5-Omni are benchmarked under zero-shot and few-shot settings using standalone, chain-of-thought (CoT), and LogicLM inference strategies. Contribution/Results: Models achieve near-perfect accuracy on MIDI but suffer substantial degradation on raw audio; existing reasoning enhancements and prompt engineering yield only marginal gains, revealing a fundamental “auditory understanding” bottleneck. The work proposes a reproducible evaluation protocol and actionable pathways for improving audio robustness, providing both methodological foundations and empirical evidence for developing truly audition-centric music AI systems.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (LLMs) claim "musical understanding" via evaluations that conflate listening with score reading. We benchmark three SOTA LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, and Qwen2.5-Omni) across three core music skills: Syncopation Scoring, Transposition Detection, and Chord Quality Identification. Moreover, we separate three sources of variability: (i) perceptual limitations (audio vs. MIDI inputs), (ii) exposure to examples (zero- vs. few-shot manipulations), and (iii) reasoning strategies (Standalone, CoT, LogicLM). For the latter we adapt LogicLM, a framework combining LLMs with symbolic solvers to perform structured reasoning, to music. Results reveal a clear perceptual gap: models perform near ceiling on MIDI but show accuracy drops on audio. Reasoning and few-shot prompting offer minimal gains. This is expected for MIDI, where performance reaches saturation, but more surprising for audio, where LogicLM, despite near-perfect MIDI accuracy, remains notably brittle. Among models, Gemini Pro achieves the highest performance across most conditions. Overall, current systems reason well over symbols (MIDI) but do not yet "listen" reliably from audio. Our method and dataset make the perception-reasoning boundary explicit and offer actionable guidance for building robust, audio-first music systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal LLMs on core music perception tasks like syncopation and chords

Assessing performance gaps between audio and symbolic MIDI music inputs

Testing reasoning strategies and few-shot learning for music understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapted LogicLM framework combining LLMs with symbolic solvers

Benchmarked models using audio versus MIDI input variations

Evaluated reasoning strategies including standalone, CoT and LogicLM

🔎 Similar Papers

Unifying Multitrack Music Arrangement via Reconstruction Fine-Tuning and Efficient Tokenization