Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Existing speech-to-speech (S2S) models produce intelligible but insufficiently expressive speech, primarily due to the absence of objective evaluation metrics aligned with human perception. Method: We propose DeEAR, the first framework integrating phonetics and psychology to construct a multi-granularity expressiveness assessment system across three dimensions—emotion, prosody, and spontaneity. Leveraging only <500 human preference annotations, DeEAR optimizes a Spearman rank correlation coefficient (SRCC)-based objective function, jointly modeling low-level acoustic features and high-level semantic information for efficient human–machine alignment. Results: On the ExpressiveSpeech dataset (14k highly expressive utterances), DeEAR achieves SRCC = 0.86. When used to guide S2S model training, it elevates expressiveness scores from 2.0 to 23.4 (out of 100), enabling fair benchmarking and high-quality data curation.

Technology Category

Application Category

📝 Abstract

Recent speech-to-speech (S2S) models generate intelligible speech but still lack natural expressiveness, largely due to the absence of a reliable evaluation metric. Existing approaches, such as subjective MOS ratings, low-level acoustic features, and emotion recognition are costly, limited, or incomplete. To address this, we present DeEAR (Decoding the Expressive Preference of eAR), a framework that converts human preference for speech expressiveness into an objective score. Grounded in phonetics and psychology, DeEAR evaluates speech across three dimensions: Emotion, Prosody, and Spontaneity, achieving strong alignment with human perception (Spearman's Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples. Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. It not only distinguishes expressiveness gaps across S2S models but also selects 14K expressive utterances to form ExpressiveSpeech, which improves the expressive score (from 2.0 to 23.4 on a 100-point scale) of S2S models. Demos and codes are available at https://github.com/FreedomIntelligence/ExpressiveSpeech

Problem

Research questions and friction points this paper is trying to address.

Objectively measuring speech expressiveness lacking in S2S models

Converting human preference into reliable expressiveness evaluation scores

Enabling fair benchmarking and targeted data curation for models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework converts human preference to objective score

Evaluates speech across emotion, prosody, and spontaneity dimensions

Uses only 500 annotated samples for human-aligned evaluation

🔎 Similar Papers

Are we there yet? A brief survey of Music Emotion Prediction Datasets, Models and Outstanding Challenges