NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks

📅 2024-08-23
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing self-supervised speech representation learning methods suffer from high computational overhead and limited generalization capability. To address these limitations, this work introduces an efficient and general-purpose speech foundation model. First, we propose FastConformer—a lightweight backbone architecture featuring 8× downsampling—replacing computationally intensive Transformer or Conformer encoders. Second, we replace clustering-based quantization with fixed random projection to eliminate training instability inherent in discrete codebook learning. Third, we design a generalized noise augmentation strategy that explicitly decouples speaker identity while promoting robust representation learning. Our model achieves new state-of-the-art performance across diverse downstream tasks—including automatic speech recognition, speech translation, speaker diarization, and spoken language understanding—significantly outperforming mainstream self-supervised models such as Wav2Vec 2.0 and Whisper. The code and pre-trained weights are publicly available in NVIDIA NeMo.

Technology Category

Application Category

📝 Abstract
Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of current approaches are computationally expensive. In this paper, we propose a simplified and more efficient self-supervised learning framework termed as NeMo Encoder for Speech Tasks (NEST). Specifically, we adopt the FastConformer architecture with 8x sub-sampling rate, which is faster than Transformer or Conformer architectures. Instead of clustering-based quantization, we use fixed random projection for its simplicity and effectiveness. We also implement a generalized noisy speech augmentation that teaches the model to disentangle the main speaker from noise or other speakers. Experiments show that model improves over existing self-supervised models and achieves new state-of-the-art performance on a variety of speech processing tasks, such as speech recognition/translation, speaker diarization, spoken language understanding, etc. Code and checkpoints are publicly available via NVIDIA NeMo framework.
Problem

Research questions and friction points this paper is trying to address.

Self-supervised Learning
Speech Processing
Computational Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

NEST
self-supervised learning
FastConformer
H
He Huang
NVIDIA, Santa Clara, CA, USA
T
T. Park
NVIDIA, Santa Clara, CA, USA
Kunal Dhawan
Kunal Dhawan
Research Scientist, NVIDIA
Machine LearningDeep LearningSpeech ProcessingNatural Language ProcessingMultimodal ML
I
I. Medennikov
NVIDIA, Santa Clara, CA, USA
Krishna C. Puvvada
Krishna C. Puvvada
NVIDIA
Artificial IntelligenceHuman-Machine Interfaces
N
N. Koluguri
NVIDIA, Santa Clara, CA, USA
W
Weiqing Wang
NVIDIA, Santa Clara, CA, USA
J
Jagadeesh Balam
NVIDIA, Santa Clara, CA, USA
Boris Ginsburg
Boris Ginsburg
NVIDIA
Deep LearningSpeech RecognitionSpeech Synthesis