🤖 AI Summary
This work addresses the performance limitations of discrete speech tokens in self-supervised learning, which arise from information loss due to quantization and hinder downstream task effectiveness. The authors propose a novel paradigm that employs hard discretization during training to maintain computational efficiency, while introducing— for the first time—a soft token probability distribution during inference to enhance representational capacity. This approach applies soft assignments exclusively at the downstream inference stage, preserving training efficiency while significantly improving model generalization and expressiveness. Experimental results demonstrate consistent superiority over conventional hard assignment methods on automatic speech recognition (ASR) and text-to-speech synthesis tasks, with particularly notable gains in out-of-domain and non-native ASR scenarios—even surpassing models based on continuous representations.
📝 Abstract
Discrete speech tokens obtained from self-supervised learning (SSL) models provide efficient data compression while maintaining strong performance, and have been widely used as intermediate representations in various tasks. However, discretization inevitably causes information loss, leading to degraded performance compared with continuous SSL features. In this work, we propose to apply soft token assignment only during downstream inference. This approach preserves the efficiency of hard discretization during training while enhancing the expressiveness of the tokens at inference. The proposed method outperforms conventional hard assignment on both ASR and speech synthesis tasks, and exhibits particularly strong generalizability to out-of-domain data. For ASR of non-native speech, it even surpasses models using continuous SSL features. Moreover, analysis of the resulting representations shows they align more accurately with phonemes compared with conventional hard assignment.