An Exploration of Mamba for Speech Self-Supervised Models

📅 2025-06-14

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses efficiency and representational bottlenecks of Transformers in speech self-supervised learning (SSL), particularly for long-sequence modeling, streaming speech processing, and fine-grained unit extraction. We propose a Mamba-based HuBERT model, replacing the Transformer encoder with a Selective State Space Model (SSM). The model is comprehensively evaluated on streaming ASR fine-tuning and the SUPERB benchmark. Results demonstrate: (1) significantly reduced computational cost for long-context ASR fine-tuning; (2) superior streaming ASR performance over Transformer baselines; (3) enhanced accuracy on causal speech tasks—including automatic speech recognition and speaker verification; (4) more robust quantized representations; and (5) improved speaker feature disentanglement. To our knowledge, this is the first systematic exploration of Mamba architectures in speech SSL. Our findings establish a new paradigm for efficient, low-latency speech self-supervised learning.

Technology Category

Application Category

📝 Abstract

While Mamba has demonstrated strong performance in language modeling, its potential as a speech self-supervised (SSL) model remains underexplored, with prior studies limited to isolated tasks. To address this, we explore Mamba-based HuBERT models as alternatives to Transformer-based SSL architectures. Leveraging the linear-time Selective State Space, these models enable fine-tuning on long-context ASR with significantly lower compute. Moreover, they show superior performance when fine-tuned for streaming ASR. Beyond fine-tuning, these models show competitive performance on SUPERB probing benchmarks, particularly in causal settings. Our analysis shows that they yield higher-quality quantized representations and capture speaker-related features more distinctly than Transformer-based models. These findings highlight Mamba-based SSL as a promising and complementary direction for long-sequence modeling, real-time speech modeling, and speech unit extraction.

Problem

Research questions and friction points this paper is trying to address.

Exploring Mamba for speech self-supervised learning models

Comparing Mamba-based and Transformer-based SSL architectures

Evaluating performance on long-context and streaming ASR tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-based HuBERT for speech SSL

Linear-time Selective State Space

Superior streaming ASR performance

🔎 Similar Papers

Mamba in Speech: Towards an Alternative to Self-Attention