🤖 AI Summary
Existing speech-to-speech (S2S) models produce intelligible but insufficiently expressive speech, primarily due to the absence of objective evaluation metrics aligned with human perception.
Method: We propose DeEAR, the first framework integrating phonetics and psychology to construct a multi-granularity expressiveness assessment system across three dimensions—emotion, prosody, and spontaneity. Leveraging only <500 human preference annotations, DeEAR optimizes a Spearman rank correlation coefficient (SRCC)-based objective function, jointly modeling low-level acoustic features and high-level semantic information for efficient human–machine alignment.
Results: On the ExpressiveSpeech dataset (14k highly expressive utterances), DeEAR achieves SRCC = 0.86. When used to guide S2S model training, it elevates expressiveness scores from 2.0 to 23.4 (out of 100), enabling fair benchmarking and high-quality data curation.
📝 Abstract
Recent speech-to-speech (S2S) models generate intelligible speech but still lack natural expressiveness, largely due to the absence of a reliable evaluation metric. Existing approaches, such as subjective MOS ratings, low-level acoustic features, and emotion recognition are costly, limited, or incomplete. To address this, we present DeEAR (Decoding the Expressive Preference of eAR), a framework that converts human preference for speech expressiveness into an objective score. Grounded in phonetics and psychology, DeEAR evaluates speech across three dimensions: Emotion, Prosody, and Spontaneity, achieving strong alignment with human perception (Spearman's Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples. Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. It not only distinguishes expressiveness gaps across S2S models but also selects 14K expressive utterances to form ExpressiveSpeech, which improves the expressive score (from 2.0 to 23.4 on a 100-point scale) of S2S models. Demos and codes are available at https://github.com/FreedomIntelligence/ExpressiveSpeech