🤖 AI Summary
This study systematically evaluates multimodal large language models (MLLMs) on three core music perception tasks—segmentation scoring, key transposition detection, and chord recognition—focusing on the performance gap between audio and MIDI inputs. Method: It introduces the LogicLM framework to music for the first time, explicitly decoupling perception and reasoning modules to establish an audio-first, structured evaluation paradigm. Gemini Pro and Qwen2.5-Omni are benchmarked under zero-shot and few-shot settings using standalone, chain-of-thought (CoT), and LogicLM inference strategies. Contribution/Results: Models achieve near-perfect accuracy on MIDI but suffer substantial degradation on raw audio; existing reasoning enhancements and prompt engineering yield only marginal gains, revealing a fundamental “auditory understanding” bottleneck. The work proposes a reproducible evaluation protocol and actionable pathways for improving audio robustness, providing both methodological foundations and empirical evidence for developing truly audition-centric music AI systems.
📝 Abstract
Multimodal Large Language Models (LLMs) claim "musical understanding" via evaluations that conflate listening with score reading. We benchmark three SOTA LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, and Qwen2.5-Omni) across three core music skills: Syncopation Scoring, Transposition Detection, and Chord Quality Identification. Moreover, we separate three sources of variability: (i) perceptual limitations (audio vs. MIDI inputs), (ii) exposure to examples (zero- vs. few-shot manipulations), and (iii) reasoning strategies (Standalone, CoT, LogicLM). For the latter we adapt LogicLM, a framework combining LLMs with symbolic solvers to perform structured reasoning, to music. Results reveal a clear perceptual gap: models perform near ceiling on MIDI but show accuracy drops on audio. Reasoning and few-shot prompting offer minimal gains. This is expected for MIDI, where performance reaches saturation, but more surprising for audio, where LogicLM, despite near-perfect MIDI accuracy, remains notably brittle. Among models, Gemini Pro achieves the highest performance across most conditions. Overall, current systems reason well over symbols (MIDI) but do not yet "listen" reliably from audio. Our method and dataset make the perception-reasoning boundary explicit and offer actionable guidance for building robust, audio-first music systems.