Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech-to-speech (S2S) models produce intelligible but insufficiently expressive speech, primarily due to the absence of objective evaluation metrics aligned with human perception. Method: We propose DeEAR, the first framework integrating phonetics and psychology to construct a multi-granularity expressiveness assessment system across three dimensions—emotion, prosody, and spontaneity. Leveraging only <500 human preference annotations, DeEAR optimizes a Spearman rank correlation coefficient (SRCC)-based objective function, jointly modeling low-level acoustic features and high-level semantic information for efficient human–machine alignment. Results: On the ExpressiveSpeech dataset (14k highly expressive utterances), DeEAR achieves SRCC = 0.86. When used to guide S2S model training, it elevates expressiveness scores from 2.0 to 23.4 (out of 100), enabling fair benchmarking and high-quality data curation.

Technology Category

Application Category

📝 Abstract
Recent speech-to-speech (S2S) models generate intelligible speech but still lack natural expressiveness, largely due to the absence of a reliable evaluation metric. Existing approaches, such as subjective MOS ratings, low-level acoustic features, and emotion recognition are costly, limited, or incomplete. To address this, we present DeEAR (Decoding the Expressive Preference of eAR), a framework that converts human preference for speech expressiveness into an objective score. Grounded in phonetics and psychology, DeEAR evaluates speech across three dimensions: Emotion, Prosody, and Spontaneity, achieving strong alignment with human perception (Spearman's Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples. Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. It not only distinguishes expressiveness gaps across S2S models but also selects 14K expressive utterances to form ExpressiveSpeech, which improves the expressive score (from 2.0 to 23.4 on a 100-point scale) of S2S models. Demos and codes are available at https://github.com/FreedomIntelligence/ExpressiveSpeech
Problem

Research questions and friction points this paper is trying to address.

Objectively measuring speech expressiveness lacking in S2S models
Converting human preference into reliable expressiveness evaluation scores
Enabling fair benchmarking and targeted data curation for models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework converts human preference to objective score
Evaluates speech across emotion, prosody, and spontaneity dimensions
Uses only 500 annotated samples for human-aligned evaluation
🔎 Similar Papers
No similar papers found.
Zhiyu Lin
Zhiyu Lin
Beijing Jiaotong University
J
Jingwen Yang
The Chinese University of Hong Kong, Shenzhen, China
J
Jiale Zhao
Li Auto Inc., China
M
Meng Liu
Li Auto Inc., China
S
Sunzhu Li
Li Auto Inc., China
Benyou Wang
Benyou Wang
Assistant Professor, The Chinese University of Hong Kong, Shenzhen
large language modelsnatural language processinginformation retrievalapplied machine learning