Enroll-on-Wakeup: A First Comparative Study of Target Speech Extraction for Seamless Interaction in Real Noisy Human-Machine Dialogue Scenarios

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the challenge that conventional target speech extraction relies on high-quality pre-recorded enrollment utterances, which limits its applicability in natural human–machine interaction under real-world noisy conditions. To overcome this limitation, the paper proposes the Enroll-on-Wakeup (EoW) framework, which systematically explores using wake-up words—naturally captured during device activation—as enrollment references, thereby eliminating the need for pre-recorded audio. The approach integrates state-of-the-art discriminative and generative speech extraction models and leverages a large language model–driven text-to-speech (TTS) system to enhance short, noisy wake-up utterances. Experimental results demonstrate that TTS augmentation substantially improves perceptual speech quality, validating the feasibility and potential of the EoW paradigm in realistic scenarios, while also highlighting remaining challenges in speech recognition performance that require further improvement.

Technology Category

Application Category

📝 Abstract

Target speech extraction (TSE) typically relies on pre-recorded high-quality enrollment speech, which disrupts user experience and limits feasibility in spontaneous interaction. In this paper, we propose Enroll-on-Wakeup (EoW), a novel framework where the wake-word segment, captured naturally during human-machine interaction, is automatically utilized as the enrollment reference. This eliminates the need for pre-collected speech to enable a seamless experience. We perform the first systematic study of EoW-TSE, evaluating advanced discriminative and generative models under real diverse acoustic conditions. Given the short and noisy nature of wake-word segments, we investigate enrollment augmentation using LLM-based TTS. Results show that while current TSE models face performance degradation in EoW-TSE, TTS-based assistance significantly enhances the listening experience, though gaps remain in speech recognition accuracy.

Problem

Research questions and friction points this paper is trying to address.

Target speech extraction

Enroll-on-Wakeup

Wake-word

Human-machine dialogue

Noisy environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enroll-on-Wakeup

target speech extraction

wake-word enrollment

LLM-based TTS

seamless human-machine interaction

🔎 Similar Papers

Keyword-Aware ASR Error Augmentation for Robust Dialogue State Tracking

2024-09-10arXiv.orgCitations: 1