Learning disentangled representations for instrument-based music similarity

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In music similarity retrieval based on instruments, conventional approaches relying on source separation are impractical and introduce artifacts, while also lacking flexibility for multi-attribute queries—especially timbre. To address this, we propose a single-network Conditional Similarity Network (CSN) that directly processes mixed audio and learns disentangled instrument-specific subspaces within a unified embedding space. Leveraging triplet loss and an instrument masking mechanism, CSN achieves end-to-end, separation-free semantic disentanglement of instrument attributes. Experiments demonstrate that our method significantly outperforms multi-network separation-based baselines in low-accuracy instrument classification; each learned subspace effectively captures instrument-specific characteristics; and user studies confirm high acceptability of timbre- and instrument-attribute-driven retrieval results. To the best of our knowledge, this is the first work to enable controllable, instrument-dimension-aware music similarity retrieval using only mixed audio—without requiring separation or auxiliary networks.

Technology Category

Application Category

📝 Abstract
A flexible recommendation and retrieval system requires music similarity in terms of multiple partial elements of musical pieces to allow users to select the element they want to focus on. A method for music similarity learning using multiple networks with individual instrumental signals is effective but faces the problem that using each clean instrumental signal as a query is impractical for retrieval systems and using separated instrumental sounds reduces accuracy owing to artifacts. In this paper, we present instrumental-part-based music similarity learning with a single network that takes mixed sounds as input instead of individual instrumental sounds. Specifically, we designed a single similarity embedding space with disentangled dimensions for each instrument, extracted by Conditional Similarity Networks, which are trained using the triplet loss with masks. Experimental results showed that (1) the proposed method can obtain more accurate feature representation than using individual networks using separated sounds as input in the evaluation of an instrument that had low accuracy, (2) each sub-embedding space can hold the characteristics of the corresponding instrument, and (3) the selection of similar musical pieces focusing on each instrumental sound by the proposed method can obtain human acceptance, especially when focusing on timbre.
Problem

Research questions and friction points this paper is trying to address.

Learning music similarity from mixed sounds, not separated instruments
Creating single embedding space with disentangled instrument dimensions
Improving accuracy and human acceptance in instrument-focused music retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single network processes mixed sounds input
Disentangled embedding dimensions per instrument
Conditional Similarity Networks with triplet loss
🔎 Similar Papers
No similar papers found.
Yuka Hashizume
Yuka Hashizume
Nagoya University
Music Information Retrieval
L
Li Li
Information Technology Center, Nagoya University, Aichi, Japan
A
Atsushi Miyashita
Graduate School of Informatics, Nagoya University, Nagoya, Japan
T
T. Toda
Information Technology Center, Nagoya University, Nagoya, Japan