🤖 AI Summary
This work addresses the critical scarcity of speech data for three typologically similar yet severely low-resource North American Indigenous languages—Ojibwe, Mi’kmaq, and Maliseet. Method: We propose the first lightweight, attention-free multilingual flow-matching text-to-speech (TTS) system for these languages. Our approach introduces joint multilingual training—a novel empirical validation for Indigenous language TTS—and leverages parameter sharing within a flow-matching architecture to enhance memory efficiency and cross-lingual generalization. Contribution/Results: (1) The multilingual model consistently outperforms monolingual baselines in naturalness and intelligibility, meeting requirements for language revitalization; (2) We develop a community-centered human evaluation framework that identifies and mitigates cultural biases inherent in conventional automatic and subjective metrics. This work establishes a reproducible technical pipeline and an ethically grounded evaluation framework for TTS in low-resource endangered languages.
📝 Abstract
We present lightweight flow matching multilingual text-to-speech (TTS) systems for Ojibwe, Mi'kmaq, and Maliseet, three Indigenous languages in North America. Our results show that training a multilingual TTS model on three typologically similar languages can improve the performance over monolingual models, especially when data are scarce. Attention-free architectures are highly competitive with self-attention architecture with higher memory efficiency. Our research not only advances technical development for the revitalization of low-resource languages but also highlights the cultural gap in human evaluation protocols, calling for a more community-centered approach to human evaluation.