🤖 AI Summary
This work addresses the challenge that conventional target speech extraction relies on high-quality pre-recorded enrollment utterances, which limits its applicability in natural human–machine interaction under real-world noisy conditions. To overcome this limitation, the paper proposes the Enroll-on-Wakeup (EoW) framework, which systematically explores using wake-up words—naturally captured during device activation—as enrollment references, thereby eliminating the need for pre-recorded audio. The approach integrates state-of-the-art discriminative and generative speech extraction models and leverages a large language model–driven text-to-speech (TTS) system to enhance short, noisy wake-up utterances. Experimental results demonstrate that TTS augmentation substantially improves perceptual speech quality, validating the feasibility and potential of the EoW paradigm in realistic scenarios, while also highlighting remaining challenges in speech recognition performance that require further improvement.
📝 Abstract
Target speech extraction (TSE) typically relies on pre-recorded high-quality enrollment speech, which disrupts user experience and limits feasibility in spontaneous interaction. In this paper, we propose Enroll-on-Wakeup (EoW), a novel framework where the wake-word segment, captured naturally during human-machine interaction, is automatically utilized as the enrollment reference. This eliminates the need for pre-collected speech to enable a seamless experience. We perform the first systematic study of EoW-TSE, evaluating advanced discriminative and generative models under real diverse acoustic conditions. Given the short and noisy nature of wake-word segments, we investigate enrollment augmentation using LLM-based TTS. Results show that while current TSE models face performance degradation in EoW-TSE, TTS-based assistance significantly enhances the listening experience, though gaps remain in speech recognition accuracy.