NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

📅 2024-08-23

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

🤖 AI Summary

Existing self-supervised speech representation learning methods suffer from high computational overhead and limited generalization capability. To address these limitations, this work introduces an efficient and general-purpose speech foundation model. First, we propose FastConformer—a lightweight backbone architecture featuring 8× downsampling—replacing computationally intensive Transformer or Conformer encoders. Second, we replace clustering-based quantization with fixed random projection to eliminate training instability inherent in discrete codebook learning. Third, we design a generalized noise augmentation strategy that explicitly decouples speaker identity while promoting robust representation learning. Our model achieves new state-of-the-art performance across diverse downstream tasks—including automatic speech recognition, speech translation, speaker diarization, and spoken language understanding—significantly outperforming mainstream self-supervised models such as Wav2Vec 2.0 and Whisper. The code and pre-trained weights are publicly available in NVIDIA NeMo.

Technology Category

Application Category

📝 Abstract

Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of current approaches are computationally expensive. In this paper, we propose a simplified and more efficient self-supervised learning framework termed as NeMo Encoder for Speech Tasks (NEST). Specifically, we adopt the FastConformer architecture with 8x sub-sampling rate, which is faster than Transformer or Conformer architectures. Instead of clustering-based quantization, we use fixed random projection for its simplicity and effectiveness. We also implement a generalized noisy speech augmentation that teaches the model to disentangle the main speaker from noise or other speakers. Experiments show that model improves over existing self-supervised models and achieves new state-of-the-art performance on a variety of speech processing tasks, such as speech recognition/translation, speaker diarization, spoken language understanding, etc. Code and checkpoints are publicly available via NVIDIA NeMo framework.

Problem

Research questions and friction points this paper is trying to address.

Self-supervised Learning

Speech Processing

Computational Efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

NEST

self-supervised learning

FastConformer

🔎 Similar Papers

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations

2024-06-09InterspeechCitations: 1

Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs

2024-08-29arXiv.orgCitations: 2

Authors to Follow