🤖 AI Summary
Traditional phonemization methods exhibit low accuracy and poor consistency on proper names, loanwords, abbreviations, and homographs. To address these challenges, OLaPh introduces an integrated text-to-phoneme conversion framework: it constructs a large-scale, multi-source pronunciation dictionary; incorporates NLP-based preprocessing, compound-word segmentation, and rule-based engines; and employs a probabilistic scoring function for multi-strategy decision fusion. Furthermore, OLaPh fine-tunes a large language model using synthetically generated data to improve generalization to out-of-vocabulary words and low-frequency variants. On German and English benchmark datasets, OLaPh significantly outperforms state-of-the-art approaches—particularly on challenging lexical items—achieving substantial gains in phonemization accuracy. The project fully open-sources all models, dictionaries, and code, providing a reproducible, extensible infrastructure for speech synthesis frontend research.
📝 Abstract
Phonemization, the conversion of text into phonemes, is a key step in text-to-speech. Traditional approaches use rule-based transformations and lexicon lookups, while more advanced methods apply preprocessing techniques or neural networks for improved accuracy on out-of-domain vocabulary. However, all systems struggle with names, loanwords, abbreviations, and homographs. This work presents OLaPh (Optimal Language Phonemizer), a framework that combines large lexica, multiple NLP techniques, and compound resolution with a probabilistic scoring function. Evaluations in German and English show improved accuracy over previous approaches, including on a challenging dataset. To further address unresolved cases, we train a large language model on OLaPh-generated data, which achieves even stronger generalization and performance. Together, the framework and LLM improve phonemization consistency and provide a freely available resource for future research.