Do speech foundation models perceive speaker similarity as humans do?

📅 2026-06-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

242K/year
🤖 AI Summary
This study investigates whether speaker embedding distances produced by speech foundation models align with human subjective perceptions of speaker similarity. Through a large-scale human listening experiment, the authors systematically evaluate the correlation between embedding-based distances from over forty speech foundation models and human similarity ratings, while further analyzing how architectural choices and training objectives influence this perceptual alignment. The work provides the first empirical evidence of both the effectiveness and limitations of current models in capturing human judgments of speaker similarity, identifies key factors affecting consistency with human perception, and offers theoretical insights and practical guidance for designing speech foundation models that better conform to human auditory perception.
📝 Abstract
This study presents a comparative analysis between the speaker embeddings of speech foundation models and human subjective perception of speaker similarity. Human listeners have the ability to judge speaker similarity on a continuous scale discerning how similar two voices are. In contrast, speech foundation models embed speaker characteristics into numerical representation. However, a question remains: does the numerical distance between speaker embeddings in these models truly align with the similarity perceived by humans? To address this, we conduct a comprehensive investigation using more than 40 models to compare model-derived distances with human-perceived similarity scores. Furthermore, we identify which factors in model configuration contribute most to a speaker embedding that mirrors human perception. Our findings provide insights for the development of more perceptually grounded speech foundation models.
Problem

Research questions and friction points this paper is trying to address.

speech foundation models
speaker similarity
human perception
speaker embeddings
perceptual alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

speaker similarity
speech foundation models
perceptual alignment
speaker embeddings
human perception
🔎 Similar Papers