🤖 AI Summary
Neural text-to-speech (TTS) systems often exhibit systematic pronunciation errors on low-resource proper nouns—such as non-English names and locations—due to the predominance of English data in training corpora. To address this, this work proposes SonoEdit, the first method to incorporate null-space-constrained editing for TTS pronunciation correction. By leveraging acoustic causal tracing, SonoEdit identifies the Transformer layers responsible for target-word pronunciation and computes a closed-form weight update that steers the correction direction orthogonal to the subspace governing general speech synthesis. This approach requires only a single parameter update—without retraining or manual phonetic annotations—to precisely rectify the pronunciation of specific words while strictly preserving all other speech generation behaviors. The method achieves first-order zero perturbation on retained utterances, significantly outperforming conventional fine-tuning or data-dependent strategies.
📝 Abstract
Neural text-to-speech (TTS) systems systematically mispronounce low-resource proper nouns, particularly non-English names, brands, and geographic locations, due to their underrepresentation in predominantly English training corpora. Existing solutions typically rely on expensive multilingual data collection, supervised finetuning, or manual phonetic annotation, which limits the deployment of TTS systems in linguistically diverse settings. We introduce SonoEdit, a model editing technique that surgically corrects pronunciation errors in pre-trained TTS models without retraining. Instead of costly finetuning or explicit phoneme injection, we propose a parsimonious alternative based on Null-Space Pronunciation Editing, which performs a single-shot parameter update to modify the pronunciation of specific words while provably preserving all other model behavior. We first adapt Acoustic Causal Tracing to identify the Transformer layers responsible for text-to-pronunciation mapping. We then apply Null-Space Constrained Editing to compute a closed-form weight update that corrects the target pronunciation while remaining mathematically orthogonal to the subspace governing general speech generation. This constrained update steers the model's acoustic output toward a desired pronunciation exemplar while guaranteeing zero first-order change on a preserved speech corpus.