DM-Codec: Distilling Multimodal Representations for Speech Tokenization

📅 2024-10-19
🏛️ arXiv.org
📈 Citations: 3
Influential: 1
📄 PDF
🤖 AI Summary
To address the performance limitations in speech discretization caused by the disentanglement of acoustic, semantic, and contextual representations—hindering both reconstruction fidelity and linguistic understanding—this paper proposes the Multimodal Distillation Voice Tokenizer (MDVTokenizer). Our method explicitly incorporates a language model into the speech tokenization framework for the first time, enabling language-model-guided contextual distillation and a joint multimodal distillation mechanism driven by both the language model and a self-supervised speech model. This unifies acoustic, semantic, and contextual representation learning. Built upon a lightweight encoder-decoder architecture with residual vector quantization (RVQ), MDVTokenizer is trained end-to-end in a fully differentiable manner. On LibriSpeech, it achieves state-of-the-art performance: a 13.46% relative reduction in word error rate (WER), a 9.82% reduction in word information loss (WIL), and improvements of 5.84% and 1.85% in speech quality and intelligibility, respectively.

Technology Category

Application Category

📝 Abstract
Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into two categories: acoustic tokens from audio codecs and semantic tokens from speech self-supervised learning models. Although recent efforts have unified acoustic and semantic tokens for improved performance, they overlook the crucial role of contextual representation in comprehensive speech modeling. Our empirical investigations reveal that the absence of contextual representations results in elevated Word Error Rate (WER) and Word Information Lost (WIL) scores in speech transcriptions. To address these limitations, we propose two novel distillation approaches: (1) a language model (LM)-guided distillation method that incorporates contextual information, and (2) a combined LM and self-supervised speech model (SM)-guided distillation technique that effectively distills multimodal representations (acoustic, semantic, and contextual) into a comprehensive speech tokenizer, termed DM-Codec. The DM-Codec architecture adopts a streamlined encoder-decoder framework with a Residual Vector Quantizer (RVQ) and incorporates the LM and SM during the training process. Experiments show DM-Codec significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset. The code, samples, and model checkpoints are available at https://github.com/mubtasimahasan/DM-Codec.
Problem

Research questions and friction points this paper is trying to address.

Mapping multidimensional speech attributes into discrete tokens
Integrating contextual representations to improve speech modeling
Distilling multimodal representations for comprehensive speech tokenization
Innovation

Methods, ideas, or system contributions that make the work stand out.

LM-guided distillation for contextual speech representation
Combined LM and SM distillation for multimodal tokens
Streamlined encoder-decoder with RVQ for speech tokenization
🔎 Similar Papers
2024-07-22arXiv.orgCitations: 4