PatchDSU: Uncertainty Modeling for Out of Distribution Generalization in Keyword Spotting

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address degraded out-of-domain generalization in keyword spotting caused by distribution mismatch between training and test data, this paper proposes a spectrogram patch-based uncertainty modeling method. It innovatively partitions the input spectrogram into local patches, models the statistical properties of each patch individually, and represents domain shift uncertainty (DSU) via multivariate Gaussian distributions. Crucially, DSU-aware features are dynamically substituted for original patch features across multiple layers of a deep neural network to enhance robustness. This approach effectively mitigates statistical bias induced by speech temporal sparsity. Experiments on Google Speech Commands, LibriSpeech, and TED-LIUM demonstrate that the method consistently outperforms state-of-the-art approaches under white noise and MUSAN music corruptions, achieving significantly improved cross-domain generalization stability.

Technology Category

Application Category

📝 Abstract

Deep learning models excel at many tasks but rely on the assumption that training and test data follow the same distribution. This assumption often does not hold in real-world speech systems, where distribution shifts are common due to varying environments, recording conditions, and speaker diversity. The method of Domain Shifts with Uncertainty (DSU) augments the input of each neural network layer based on the input feature statistics. It addresses the problem of out-of-domain generalization by assuming feature statistics follow a multivariate Gaussian distribution and substitutes the input with sampled features from this distribution. While effective for computer vision, applying DSU to speech presents challenges due to the nature of the data. Unlike static visual data, speech is a temporal signal commonly represented by a spectrogram - the change of frequency over time. This representation cannot be treated as a simple image, and the resulting sparsity can lead to skewed feature statistics when applied to the entire input. To tackle out-of-distribution issues in keyword spotting, we propose PatchDSU, which extends DSU by splitting the input into patches and independently augmenting each patch. We evaluated PatchDSU and DSU alongside other methods on the Google Speech Commands, Librispeech, and TED-LIUM. Additionally, we evaluated performance under white Gaussian and MUSAN music noise conditions. We also explored out-of-domain generalization by analyzing model performance on datasets they were not trained on. Overall, in most cases, both PatchDSU and DSU outperform other methods. Notably, PatchDSU demonstrates more consistent improvements across the evaluated scenarios compared to other approaches.

Problem

Research questions and friction points this paper is trying to address.

Addresses out-of-distribution generalization in keyword spotting

Handles speech data challenges unlike static visual data

Improves robustness against varying environments and speaker diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

PatchDSU splits input into patches for augmentation

Models feature uncertainty via multivariate Gaussian distribution

Independently augments each patch to improve generalization

🔎 Similar Papers

No similar papers found.

Authors to Follow