🤖 AI Summary
Existing self-supervised speech representation learning methods suffer from high computational overhead and limited generalization capability. To address these limitations, this work introduces an efficient and general-purpose speech foundation model. First, we propose FastConformer—a lightweight backbone architecture featuring 8× downsampling—replacing computationally intensive Transformer or Conformer encoders. Second, we replace clustering-based quantization with fixed random projection to eliminate training instability inherent in discrete codebook learning. Third, we design a generalized noise augmentation strategy that explicitly decouples speaker identity while promoting robust representation learning. Our model achieves new state-of-the-art performance across diverse downstream tasks—including automatic speech recognition, speech translation, speaker diarization, and spoken language understanding—significantly outperforming mainstream self-supervised models such as Wav2Vec 2.0 and Whisper. The code and pre-trained weights are publicly available in NVIDIA NeMo.
📝 Abstract
Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of current approaches are computationally expensive. In this paper, we propose a simplified and more efficient self-supervised learning framework termed as NeMo Encoder for Speech Tasks (NEST). Specifically, we adopt the FastConformer architecture with 8x sub-sampling rate, which is faster than Transformer or Conformer architectures. Instead of clustering-based quantization, we use fixed random projection for its simplicity and effectiveness. We also implement a generalized noisy speech augmentation that teaches the model to disentangle the main speaker from noise or other speakers. Experiments show that model improves over existing self-supervised models and achieves new state-of-the-art performance on a variety of speech processing tasks, such as speech recognition/translation, speaker diarization, spoken language understanding, etc. Code and checkpoints are publicly available via NVIDIA NeMo framework.