🤖 AI Summary
This paper addresses dual biases in text-to-speech (TTS) systems: accent bias—overreliance on dominant pronunciation patterns—and linguistic bias—neglect of dialectal lexicon and cultural cues. We propose a dual-signal optimization framework that decouples accent fidelity modeling from dialectal text localization. Our approach integrates contextual language adaptation, retrieval-augmented accent prompting (RAAP), and instruction-guided generation to achieve fair, culturally grounded multi-accent speech synthesis. The method is architecture-agnostic, requiring no modification to core TTS models. Evaluations across 12 English accents demonstrate significant improvements in accent identification accuracy and generation fairness (+18.7% average fairness score), while preserving high naturalness (MOS ≥ 4.1). This work establishes a scalable, inclusive paradigm for equitable TTS synthesis.
📝 Abstract
Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist: accent bias, where models default to dominant phonetic patterns, and linguistic bias, where dialect-specific lexical and cultural cues are ignored. These biases are interdependent, as authentic accent generation requires both accent fidelity and localized text. We present Contextual Linguistic Adaptation and Retrieval for Inclusive TTS sYnthesis (CLARITY), a backbone-agnostic framework that addresses these biases through dual-signal optimization: (i) contextual linguistic adaptation that localizes input text to the target dialect, and (ii) retrieval-augmented accent prompting (RAAP) that supplies accent-consistent speech prompts. Across twelve English accents, CLARITY improves accent accuracy and fairness while maintaining strong perceptual quality.