🤖 AI Summary
This study addresses the problem of modeling and retrieving music based on instrumental stems (e.g., drums, bass, lead) by quantifying perceptual similarity at the stem level. Methodologically, it conducts a large-scale ABX auditory discrimination experiment with 586 participants, leveraging stem-separated audio from Slakh2100 and multi-dimensional similarity annotations—timbral, rhythmic, melodic, and holistic. The analysis systematically measures each stem’s differential contribution to overall perceptual similarity. Key contributions include: (1) empirical evidence that rhythm and melody dominate similarity judgments across most stems (with drums being the exception, exhibiting negligible melodic contribution); (2) identification of a critical bias in mainstream music representation learning methods, which overemphasize timbre while neglecting stem-specific perceptual mechanisms; and (3) establishment of an empirically grounded, interpretable weighting benchmark and a novel evaluation paradigm for stem-level music representation learning.
📝 Abstract
This paper presents an investigation of perceptual similarity between music tracks focusing on each individual instrumental part based on a large-scale listening test towards developing an instrumental-part-based music retrieval. In the listening test, 586 subjects evaluate the perceptual similarity of the audio tracks through an ABX test. We use the music tracks and their stems in the test set of the slakh2100 dataset. The perceptual similarity is evaluated based on four perspectives: timbre, rhythm, melody, and overall. We have analyzed the results of the listening test and have found that 1) perceptual music similarity varies depending on which instrumental part is focused on within each track; 2) rhythm and melody tend to have a larger impact on the perceptual music similarity than timbre except for the melody of drums; and 3) the previously proposed music similarity features tend to capture the perceptual similarity on timbre mainly.