🤖 AI Summary
Dialogue State Tracking (DST) in speech-driven task-oriented dialogues is highly vulnerable to Automatic Speech Recognition (ASR) errors—especially misrecognitions of named entities—leading to substantial performance degradation. To address this, we propose a keyword-aware controllable error augmentation method: first, leveraging prompt-based learning to identify critical slot-value positions; second, constructing a phoneme-similarity confusion model to inject semantically plausible and acoustically similar synthetic errors *only* at the identified positions, thereby generating high-quality noisy training data; and third, performing end-to-end fine-tuning to enhance DST robustness. This is the first approach to jointly integrate keyword localization with phoneme-level modeling for controllable, targeted error generation. Empirical results demonstrate significant improvements in DST accuracy across diverse ASR noise types, with particularly pronounced gains under extreme noise conditions—e.g., when ASR word accuracy falls below 80%.
📝 Abstract
Dialogue State Tracking (DST) is a key part of task-oriented dialogue systems, identifying important information in conversations. However, its accuracy drops significantly in spoken dialogue environments due to named entity errors from Automatic Speech Recognition (ASR) systems. We introduce a simple yet effective data augmentation method that targets those entities to improve the robustness of DST model. Our novel method can control the placement of errors using keyword-highlighted prompts while introducing phonetically similar errors. As a result, our method generated sufficient error patterns on keywords, leading to improved accuracy in noised and low-accuracy ASR environments.