NüshuVoice: Reviving the Voice of Endangered Nüshu with Pitch-Aware Text-to-Speech

📅 2026-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of synthesizing natural-sounding speech for Nüshu, an endangered script lacking sentence-level spoken data, as existing recordings consist only of isolated syllables and are insufficient for conventional text-to-speech (TTS) systems. To bridge this gap, we introduce NüshuVoice—the first TTS benchmark for Nüshu—comprising a sentence-level dataset aligned with Nüshu Unicode text, phonetic transcriptions, Chinese translations, and archival audio recordings. We further propose Nüshu-PitchVITS, a novel model that explicitly incorporates Nüshu’s five-level tone marks as a prosodic inductive bias within an F0-conditioned VITS framework. Leveraging phoneme transcription and transfer learning, our approach achieves high-quality speech synthesis under extremely low-resource conditions. Experimental results demonstrate superior performance over strong baselines in spectral fidelity, fundamental frequency reconstruction, and human intelligibility. The dataset and code are publicly released.
📝 Abstract
Nüshu is an endangered phonetic script historically used by women in Jiangyong County, southern Hunan, China. While existing computational studies of Nüshu mainly focus on textual digitization and visual recognition, the acoustic reconstruction of its authentic pronunciation remains largely unexplored. Building a Nüshu text-to-speech (TTS) system is particularly challenging because available recordings are extremely limited and mostly consist of isolated syllable-level pronunciations rather than natural sentence-level utterances. In this work, we introduce NüshuVoice, the first TTS benchmark for Nüshu. We construct a sentence-level Nüshu text-to-audio dataset that aligns standardized Unicode Nüshu text, phonetic transcriptions, standard Chinese translations, and archival recordings. To synthesize speech under this extreme low-resource setting, we propose Nüshu-PitchVITS, an F0-conditioned VITS framework that leverages Nüshu's five-level pitch notation as an explicit prosodic inductive bias. Experimental results show that Nüshu-PitchVITS outperforms strong TTS baselines in spectral fidelity, pitch reconstruction, and human-rated intelligibility. We publicly release the dataset and code at: https://anonymous.4open.science/r/Nvshu-TTS-2EB6.
Problem

Research questions and friction points this paper is trying to address.

Nüshu
endangered language
text-to-speech
low-resource speech synthesis
pronunciation reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Nüshu
low-resource TTS
pitch-aware synthesis
F0-conditioned VITS
endangered language preservation
🔎 Similar Papers
No similar papers found.
H
Hongkun Yang
Ocean University of China
X
Xinhui Yi
The Hong Kong Polytechnic University
X
Xiyan Zhao
The Hong Kong Polytechnic University
Y
Yibo Meng
Cornell University
L
Lionel Z. Wang
The Hong Kong Polytechnic University
Lixu Wang
Lixu Wang
Northwestern University
Machine LearningData Privacy
Yaqi Zhang
Yaqi Zhang
Stanford University
Compiler for reconfigurable hardwarereconfigurable hardware accelerator for data parallel workload
Ruiqi Chen
Ruiqi Chen
Vrije Universiteit Brussel
FPGAsDomain-specific Accelerator
X
Xuanyue Zhao
Nanyang Technological University
L
Lanxin Zhang
Nanyang Technological University
Yu Zeng
Yu Zeng
University of Science and Technology of China
VLMsRL
W
Weijia Chu
The Hong Kong Polytechnic University
Y
Yiming Ma
Harbin Institute of Technology
Chenyu Liu
Chenyu Liu
Northwestern Polytechnical University
condition monitoringmachine learningsmart manufacturing
Jianghao Lin
Jianghao Lin
Shanghai Jiao Tong University
Large Language ModelsAI AgentsRecommender Systems
Xin Xu
Xin Xu
The Hong Kong Polytechnic University
Digital TransformationHuman-AI InteractionBusiness Analytics