Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios

📅 2026-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that conventional target speech extraction relies on high-quality pre-recorded enrollment utterances, which limits its applicability in natural human–machine interaction under real-world noisy conditions. To overcome this limitation, the paper proposes the Enroll-on-Wakeup (EoW) framework, which systematically explores using wake-up words—naturally captured during device activation—as enrollment references, thereby eliminating the need for pre-recorded audio. The approach integrates state-of-the-art discriminative and generative speech extraction models and leverages a large language model–driven text-to-speech (TTS) system to enhance short, noisy wake-up utterances. Experimental results demonstrate that TTS augmentation substantially improves perceptual speech quality, validating the feasibility and potential of the EoW paradigm in realistic scenarios, while also highlighting remaining challenges in speech recognition performance that require further improvement.

Technology Category

Application Category

📝 Abstract
Target speech extraction (TSE) typically relies on pre-recorded high-quality enrollment speech, which disrupts user experience and limits feasibility in spontaneous interaction. In this paper, we propose Enroll-on-Wakeup (EoW), a novel framework where the wake-word segment, captured naturally during human-machine interaction, is automatically utilized as the enrollment reference. This eliminates the need for pre-collected speech to enable a seamless experience. We perform the first systematic study of EoW-TSE, evaluating advanced discriminative and generative models under real diverse acoustic conditions. Given the short and noisy nature of wake-word segments, we investigate enrollment augmentation using LLM-based TTS. Results show that while current TSE models face performance degradation in EoW-TSE, TTS-based assistance significantly enhances the listening experience, though gaps remain in speech recognition accuracy.
Problem

Research questions and friction points this paper is trying to address.

Target speech extraction
Enroll-on-Wakeup
Wake-word
Human-machine dialogue
Noisy environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enroll-on-Wakeup
target speech extraction
wake-word enrollment
LLM-based TTS
seamless human-machine interaction
🔎 Similar Papers
No similar papers found.
Y
Yiming Yang
Shanghai Normal University, Shanghai, China
G
Guangyong Wang
Unisound AI Technology Co., Ltd., Beijing, China
H
Haixin Guan
Unisound AI Technology Co., Ltd., Beijing, China
Yanhua Long
Yanhua Long
Professor, Shanghai Normal University
Speech signal processing