🤖 AI Summary
This paper introduces the voice timbre attribute detection (vTAD) task, aiming to enable comparable and interpretable quantitative modeling of timbre using human-understandable perceptual attributes (e.g., “bright”, “hoarse”). Methodologically, it frames timbre perception differences as a contrastive attribute discrimination problem within speaker embedding space—the first such formulation—and constructs VCTK-RVA, the first dedicated vTAD benchmark dataset. It systematically evaluates two speaker encoders—ECAPA-TDNN and FACodec—revealing that ECAPA-TDNN excels on seen speakers, whereas FACodec demonstrates superior generalization to unseen speakers. Key contributions include: (1) formal definition of the vTAD task; (2) public release of the VCTK-RVA dataset and associated open-source code; and (3) empirical characterization of fundamental differences in generalization behavior between speaker encoders, thereby establishing a foundation for interpretable timbre analysis. (149 words)
📝 Abstract
This paper focuses on explaining the timbre conveyed by speech signals and introduces a task termed voice timbre attribute detection (vTAD). In this task, voice timbre is explained with a set of sensory attributes describing its human perception. A pair of speech utterances is processed, and their intensity is compared in a designated timbre descriptor. Moreover, a framework is proposed, which is built upon the speaker embeddings extracted from the speech utterances. The investigation is conducted on the VCTK-RVA dataset. Experimental examinations on the ECAPA-TDNN and FACodec speaker encoders demonstrated that: 1) the ECAPA-TDNN speaker encoder was more capable in the seen scenario, where the testing speakers were included in the training set; 2) the FACodec speaker encoder was superior in the unseen scenario, where the testing speakers were not part of the training, indicating enhanced generalization capability. The VCTK-RVA dataset and open-source code are available on the website https://github.com/vTAD2025-Challenge/vTAD.