Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current neural text-to-speech (TTS) systems rely on speaker embeddings to control accent, yet these embeddings conflate linguistic factors such as accent with non-linguistic attributes like timbre and emotion, resulting in poor interpretability and inadequate disentanglement. This work integrates linguistically motivated phonological rules—such as flapping, retroflexion, and vowel correspondences—into neural TTS models and conducts controlled experiments to analyze the interaction between speaker embeddings and rule-based transformations. We introduce the Phoneme Shift Rate (PSR), a novel metric that quantifies the extent to which speaker embeddings preserve or override phonological rules, thereby revealing representational entanglement between accent and speaker identity. Experimental results demonstrate that combining explicit phonological rules with speaker embeddings yields more authentic accents, while embeddings alone often attenuate rule effectiveness, confirming the coupling of accent and speaker characteristics and offering a new evaluation framework for interpretable accent control.

Technology Category

Application Category

📝 Abstract
Many spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific accents. While effective, this approach offers limited interpretability and controllability, as embeddings also encode traits such as timbre and emotion. In this study, we analyze the interaction between speaker embeddings and linguistically motivated phonological rules in accented speech synthesis. Using American and British English as a case study, we implement rules for flapping, rhoticity, and vowel correspondences. We propose the phoneme shift rate (PSR), a novel metric quantifying how strongly embeddings preserve or override rule-based transformations. Experiments show that combining rules with embeddings yields more authentic accents, while embeddings can attenuate or overwrite rules, revealing entanglement between accent and speaker identity. Our findings highlight rules as a lever for accent control and a framework for evaluating disentanglement in speech generation.
Problem

Research questions and friction points this paper is trying to address.

accented speech synthesis
speaker embeddings
phonological rules
disentanglement
accent control
Innovation

Methods, ideas, or system contributions that make the work stand out.

phoneme shift rate
phonological rules
accented speech synthesis
speaker embedding disentanglement
rule-based TTS
🔎 Similar Papers
No similar papers found.
T
Thanathai Lertpetchpun
Signal Analysis and Interpretation Lab, University of Southern California
Y
Yoonjeong Lee
Signal Analysis and Interpretation Lab, University of Southern California
T
Thanapat Trachu
Thomas Lord Department of Computer Science, University of Southern California
Jihwan Lee
Jihwan Lee
PhD Student, Signal Analysis and Interpretation Lab (SAIL) at University of Southern California
brain-computer interfacesspeech synthesisbiosignal-to-speecharticulatory phonetics
Tiantian Feng
Tiantian Feng
Postdoc Researcher
Health and BehaviorsWearable ComputingAffective ComputingSpeech and BiosignalResponsible ML
D
Dani Byrd
Department of Linguistics, University of Southern California
S
Shrikanth S. Narayanan
Signal Analysis and Interpretation Lab, University of Southern California; Thomas Lord Department of Computer Science, University of Southern California; Department of Linguistics, University of Southern California