🤖 AI Summary
This study addresses the challenge of synthesizing natural-sounding speech for Nüshu, an endangered script lacking sentence-level spoken data, as existing recordings consist only of isolated syllables and are insufficient for conventional text-to-speech (TTS) systems. To bridge this gap, we introduce NüshuVoice—the first TTS benchmark for Nüshu—comprising a sentence-level dataset aligned with Nüshu Unicode text, phonetic transcriptions, Chinese translations, and archival audio recordings. We further propose Nüshu-PitchVITS, a novel model that explicitly incorporates Nüshu’s five-level tone marks as a prosodic inductive bias within an F0-conditioned VITS framework. Leveraging phoneme transcription and transfer learning, our approach achieves high-quality speech synthesis under extremely low-resource conditions. Experimental results demonstrate superior performance over strong baselines in spectral fidelity, fundamental frequency reconstruction, and human intelligibility. The dataset and code are publicly released.
📝 Abstract
Nüshu is an endangered phonetic script historically used by women in Jiangyong County, southern Hunan, China. While existing computational studies of Nüshu mainly focus on textual digitization and visual recognition, the acoustic reconstruction of its authentic pronunciation remains largely unexplored. Building a Nüshu text-to-speech (TTS) system is particularly challenging because available recordings are extremely limited and mostly consist of isolated syllable-level pronunciations rather than natural sentence-level utterances. In this work, we introduce NüshuVoice, the first TTS benchmark for Nüshu. We construct a sentence-level Nüshu text-to-audio dataset that aligns standardized Unicode Nüshu text, phonetic transcriptions, standard Chinese translations, and archival recordings. To synthesize speech under this extreme low-resource setting, we propose Nüshu-PitchVITS, an F0-conditioned VITS framework that leverages Nüshu's five-level pitch notation as an explicit prosodic inductive bias. Experimental results show that Nüshu-PitchVITS outperforms strong TTS baselines in spectral fidelity, pitch reconstruction, and human-rated intelligibility. We publicly release the dataset and code at: https://anonymous.4open.science/r/Nvshu-TTS-2EB6.