CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses dual biases in text-to-speech (TTS) systems: accent bias—overreliance on dominant pronunciation patterns—and linguistic bias—neglect of dialectal lexicon and cultural cues. We propose a dual-signal optimization framework that decouples accent fidelity modeling from dialectal text localization. Our approach integrates contextual language adaptation, retrieval-augmented accent prompting (RAAP), and instruction-guided generation to achieve fair, culturally grounded multi-accent speech synthesis. The method is architecture-agnostic, requiring no modification to core TTS models. Evaluations across 12 English accents demonstrate significant improvements in accent identification accuracy and generation fairness (+18.7% average fairness score), while preserving high naturalness (MOS ≥ 4.1). This work establishes a scalable, inclusive paradigm for equitable TTS synthesis.

Technology Category

Application Category

📝 Abstract
Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist: accent bias, where models default to dominant phonetic patterns, and linguistic bias, where dialect-specific lexical and cultural cues are ignored. These biases are interdependent, as authentic accent generation requires both accent fidelity and localized text. We present Contextual Linguistic Adaptation and Retrieval for Inclusive TTS sYnthesis (CLARITY), a backbone-agnostic framework that addresses these biases through dual-signal optimization: (i) contextual linguistic adaptation that localizes input text to the target dialect, and (ii) retrieval-augmented accent prompting (RAAP) that supplies accent-consistent speech prompts. Across twelve English accents, CLARITY improves accent accuracy and fairness while maintaining strong perceptual quality.
Problem

Research questions and friction points this paper is trying to address.

Mitigates accent bias in text-to-speech generation systems
Addresses linguistic bias by localizing dialect-specific lexical cues
Resolves interdependent biases between accent fidelity and cultural context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contextual linguistic adaptation localizes input text to target dialect
Retrieval-augmented accent prompting supplies accent-consistent speech prompts
Dual-signal optimization addresses accent and linguistic biases simultaneously
🔎 Similar Papers
No similar papers found.
C
Crystal Min Hui Poon
Infocomm Technology Cluster, Singapore Institute of Technology
P
Pai Chet Ng
Infocomm Technology Cluster, Singapore Institute of Technology
Xiaoxiao Miao
Xiaoxiao Miao
Duke Kunshan University
Speech PrivacySpeaker and Language IdentificationSpeech Synthesis
I
Immanuel Jun Kai Loh
Infocomm Technology Cluster, Singapore Institute of Technology
B
Bowen Zhang
Infocomm Technology Cluster, Singapore Institute of Technology
H
Haoyu Song
Infocomm Technology Cluster, Singapore Institute of Technology
I
Ian Mcloughlin
Infocomm Technology Cluster, Singapore Institute of Technology