🤖 AI Summary
Low-quality speech synthesis and poor cultural adaptation hinder multilingual children’s education in low-resource language settings. Method: We propose the first child-centered multilingual text-to-speech (TTS) framework supporting under-resourced languages—including Singaporean Mandarin, Malay, and Tamil—by integrating large language models (LLMs) with multilingual TTS. To enhance cultural relevance, we introduce a culturally aware image captioning task to guide content generation; we further incorporate age-appropriate linguistic modeling and a dual-dimensional evaluation protocol combining objective metrics (e.g., MOS, WER) with child-in-the-loop subjective feedback. Contribution/Results: Experiments demonstrate significant improvements over baselines in speech naturalness, cultural appropriateness, and child comprehension. The framework effectively boosts learning engagement and second-language acquisition outcomes in real-world educational contexts.
📝 Abstract
Generative speech models have demonstrated significant potential in personalizing teacher-student interactions, offering valuable real-world applications for language learning in children's education. However, achieving high-quality, child-friendly speech generation remains challenging, particularly for low-resource languages across diverse languages and cultural contexts. In this paper, we propose MultiAiTutor, an educational multilingual generative AI tutor with child-friendly designs, leveraging LLM architecture for speech generation tailored for educational purposes. We propose to integrate age-appropriate multilingual speech generation using LLM architectures, facilitating young children's language learning through culturally relevant image-description tasks in three low-resource languages: Singaporean-accent Mandarin, Malay, and Tamil. Experimental results from both objective metrics and subjective evaluations demonstrate the superior performance of the proposed MultiAiTutor compared to baseline methods.