🤖 AI Summary
To address the significant performance degradation of pretrained ASR systems in cross-channel scenarios, this paper proposes a channel-aware domain-adaptive speech generation method. We design a channel encoder to explicitly model target-domain acoustic characteristics and integrate it with a conditional GAN to disentangle and re-synthesize speech content and channel features. The approach generates high-fidelity, channel-matched robust training samples using only a small amount of unlabeled target-domain speech data. Notably, this is the first work to deeply integrate explicit channel embedding modeling with GAN-based speech synthesis, eliminating the need for target-domain transcriptions. On the HAT and TAT dialect datasets, our method reduces character error rates by 20.02% and 9.64%, respectively, effectively bridging the acoustic distribution gap between source and target domains.
📝 Abstract
While pre-trained automatic speech recognition (ASR) systems demonstrate impressive performance on matched domains, their performance often degrades when confronted with channel mismatch stemming from unseen recording environments and conditions. To mitigate this issue, we propose a novel channel-aware data simulation method for robust ASR training. Our method harnesses the synergistic power of channel-extractive techniques and generative adversarial networks (GANs). We first train a channel encoder capable of extracting embeddings from arbitrary audio. On top of this, channel embeddings are extracted using a minimal amount of target-domain data and used to guide a GAN-based speech synthesizer. This synthesizer generates speech that faithfully preserves the phonetic content of the input while mimicking the channel characteristics of the target domain. We evaluate our method on the challenging Hakka Across Taiwan (HAT) and Taiwanese Across Taiwan (TAT) corpora, achieving relative character error rate (CER) reductions of 20.02% and 9.64%, respectively, compared to the baselines. These results highlight the efficacy of our channel-aware data simulation method for bridging the gap between source- and target-domain acoustics.