SonoEdit: Null-Space Constrained Knowledge Editing for Pronunciation Correction in LLM-Based TTS

📅 2026-01-23

📈 Citations: 0

✨ Influential: 0

career value

129K/year

🤖 AI Summary

Neural text-to-speech (TTS) systems often exhibit systematic pronunciation errors on low-resource proper nouns—such as non-English names and locations—due to the predominance of English data in training corpora. To address this, this work proposes SonoEdit, the first method to incorporate null-space-constrained editing for TTS pronunciation correction. By leveraging acoustic causal tracing, SonoEdit identifies the Transformer layers responsible for target-word pronunciation and computes a closed-form weight update that steers the correction direction orthogonal to the subspace governing general speech synthesis. This approach requires only a single parameter update—without retraining or manual phonetic annotations—to precisely rectify the pronunciation of specific words while strictly preserving all other speech generation behaviors. The method achieves first-order zero perturbation on retained utterances, significantly outperforming conventional fine-tuning or data-dependent strategies.

Technology Category

Application Category

📝 Abstract

Neural text-to-speech (TTS) systems systematically mispronounce low-resource proper nouns, particularly non-English names, brands, and geographic locations, due to their underrepresentation in predominantly English training corpora. Existing solutions typically rely on expensive multilingual data collection, supervised finetuning, or manual phonetic annotation, which limits the deployment of TTS systems in linguistically diverse settings. We introduce SonoEdit, a model editing technique that surgically corrects pronunciation errors in pre-trained TTS models without retraining. Instead of costly finetuning or explicit phoneme injection, we propose a parsimonious alternative based on Null-Space Pronunciation Editing, which performs a single-shot parameter update to modify the pronunciation of specific words while provably preserving all other model behavior. We first adapt Acoustic Causal Tracing to identify the Transformer layers responsible for text-to-pronunciation mapping. We then apply Null-Space Constrained Editing to compute a closed-form weight update that corrects the target pronunciation while remaining mathematically orthogonal to the subspace governing general speech generation. This constrained update steers the model's acoustic output toward a desired pronunciation exemplar while guaranteeing zero first-order change on a preserved speech corpus.

Problem

Research questions and friction points this paper is trying to address.

pronunciation correction

low-resource proper nouns

text-to-speech

mispronunciation

linguistic diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Null-Space Constrained Editing

Pronunciation Correction

Text-to-Speech