🤖 AI Summary
To address the tension between low-latency requirements in streaming speech interaction and the full-utterance assumption inherent in existing self-supervised models, this paper proposes a unified block-based self-supervised learning framework. Methodologically, it introduces: (1) a high-resolution finite scalar quantization module to construct a million-scale discrete codebook; (2) block-level masked prediction, preceding-block dependency modeling, and copy-append data augmentation; and (3) efficient quantized encoding to jointly optimize streaming and offline speech representations. The approach significantly reduces computational overhead while enhancing cross-task knowledge transfer. Evaluated on LibriSpeech (ASR) and MuST-C (ST), the model achieves state-of-the-art or highly competitive performance under both streaming and offline settings—marking the first work to concurrently optimize for low latency and high accuracy in self-supervised speech representation learning.
📝 Abstract
Low latency speech human-machine communication is becoming increasingly necessary as speech technology advances quickly in the last decade. One of the primary factors behind the advancement of speech technology is self-supervised learning. Most self-supervised learning algorithms are designed with full utterance assumption and compromises have to made if partial utterances are presented, which are common in the streaming applications. In this work, we propose a chunk based self-supervised learning (Chunk SSL) algorithm as an unified solution for both streaming and offline speech pre-training. Chunk SSL is optimized with the masked prediction loss and an acoustic encoder is encouraged to restore indices of those masked speech frames with help from unmasked frames in the same chunk and preceding chunks. A copy and append data augmentation approach is proposed to conduct efficient chunk based pre-training. Chunk SSL utilizes a finite scalar quantization (FSQ) module to discretize input speech features and our study shows a high resolution FSQ codebook, i.e., a codebook with vocabulary size up to a few millions, is beneficial to transfer knowledge from the pre-training task to the downstream tasks. A group masked prediction loss is employed during pre-training to alleviate the high memory and computation cost introduced by the large codebook. The proposed approach is examined in two speech to text tasks, i.e., speech recognition and speech translation. Experimental results on the extsc{Librispeech} and extsc{Must-C} datasets show that the proposed method could achieve very competitive results for speech to text tasks at both streaming and offline modes.