π€ AI Summary
To address the lack of speech large language models (SLMs) tailored to Taiwanese Mandarin, this paper introduces the first end-to-end real-time spoken dialogue model specifically designed for Taiwanese Guoyu. Methodologically, we adopt a decoder-only Transformer architecture, construct a high-quality synthetic spoken dialogue dataset, design a low-latency full-duplex speech interaction mechanism, and establish a multi-turn coherence evaluation framework. Our key contributions are: (1) the first full-duplex speech interaction modeling for Taiwanese Guoyu; and (2) a novel training paradigm and evaluation protocol explicitly designed for real-time spoken dialogue. Experimental results demonstrate that the prototype system supports natural, fluent multi-turn voice conversations, achieving sub-300ms response latency and validated semantic coherence. This work provides a reproducible technical pathway for developing dialectal Chinese speech LLMs.
π Abstract
This technical report presents our initial attempt to build a spoken large language model (LLM) for Taiwanese Mandarin, specifically tailored to enable real-time, speech-to-speech interaction in multi-turn conversations. Our end-to-end model incorporates a decoder-only transformer architecture and aims to achieve seamless interaction while preserving the conversational flow, including full-duplex capabilities allowing simultaneous speaking and listening. The paper also details the training process, including data preparation with synthesized dialogues and adjustments for real-time interaction. We also developed a platform to evaluate conversational fluency and response coherence in multi-turn dialogues. We hope the release of the report can contribute to the future development of spoken LLMs in Taiwanese Mandarin.