Chunk Based Speech Pre-training with High Resolution Finite Scalar Quantization

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

To address the tension between low-latency requirements in streaming speech interaction and the full-utterance assumption inherent in existing self-supervised models, this paper proposes a unified block-based self-supervised learning framework. Methodologically, it introduces: (1) a high-resolution finite scalar quantization module to construct a million-scale discrete codebook; (2) block-level masked prediction, preceding-block dependency modeling, and copy-append data augmentation; and (3) efficient quantized encoding to jointly optimize streaming and offline speech representations. The approach significantly reduces computational overhead while enhancing cross-task knowledge transfer. Evaluated on LibriSpeech (ASR) and MuST-C (ST), the model achieves state-of-the-art or highly competitive performance under both streaming and offline settings—marking the first work to concurrently optimize for low latency and high accuracy in self-supervised speech representation learning.

Technology Category

Application Category

📝 Abstract

Low latency speech human-machine communication is becoming increasingly necessary as speech technology advances quickly in the last decade. One of the primary factors behind the advancement of speech technology is self-supervised learning. Most self-supervised learning algorithms are designed with full utterance assumption and compromises have to made if partial utterances are presented, which are common in the streaming applications. In this work, we propose a chunk based self-supervised learning (Chunk SSL) algorithm as an unified solution for both streaming and offline speech pre-training. Chunk SSL is optimized with the masked prediction loss and an acoustic encoder is encouraged to restore indices of those masked speech frames with help from unmasked frames in the same chunk and preceding chunks. A copy and append data augmentation approach is proposed to conduct efficient chunk based pre-training. Chunk SSL utilizes a finite scalar quantization (FSQ) module to discretize input speech features and our study shows a high resolution FSQ codebook, i.e., a codebook with vocabulary size up to a few millions, is beneficial to transfer knowledge from the pre-training task to the downstream tasks. A group masked prediction loss is employed during pre-training to alleviate the high memory and computation cost introduced by the large codebook. The proposed approach is examined in two speech to text tasks, i.e., speech recognition and speech translation. Experimental results on the extsc{Librispeech} and extsc{Must-C} datasets show that the proposed method could achieve very competitive results for speech to text tasks at both streaming and offline modes.

Problem

Research questions and friction points this paper is trying to address.

Develops chunk-based SSL for streaming and offline speech pre-training

Proposes high-resolution FSQ for better knowledge transfer to downstream tasks

Addresses memory and computation costs with group masked prediction loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chunk-based self-supervised learning for streaming

High-resolution finite scalar quantization discretization

Group masked prediction loss for efficiency

🔎 Similar Papers

No similar papers found.