An Exploration of Mamba for Speech Self-Supervised Models

๐Ÿ“… 2025-06-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses efficiency and representational bottlenecks of Transformers in speech self-supervised learning (SSL), particularly for long-sequence modeling, streaming speech processing, and fine-grained unit extraction. We propose a Mamba-based HuBERT model, replacing the Transformer encoder with a Selective State Space Model (SSM). The model is comprehensively evaluated on streaming ASR fine-tuning and the SUPERB benchmark. Results demonstrate: (1) significantly reduced computational cost for long-context ASR fine-tuning; (2) superior streaming ASR performance over Transformer baselines; (3) enhanced accuracy on causal speech tasksโ€”including automatic speech recognition and speaker verification; (4) more robust quantized representations; and (5) improved speaker feature disentanglement. To our knowledge, this is the first systematic exploration of Mamba architectures in speech SSL. Our findings establish a new paradigm for efficient, low-latency speech self-supervised learning.

Technology Category

Application Category

๐Ÿ“ Abstract
While Mamba has demonstrated strong performance in language modeling, its potential as a speech self-supervised (SSL) model remains underexplored, with prior studies limited to isolated tasks. To address this, we explore Mamba-based HuBERT models as alternatives to Transformer-based SSL architectures. Leveraging the linear-time Selective State Space, these models enable fine-tuning on long-context ASR with significantly lower compute. Moreover, they show superior performance when fine-tuned for streaming ASR. Beyond fine-tuning, these models show competitive performance on SUPERB probing benchmarks, particularly in causal settings. Our analysis shows that they yield higher-quality quantized representations and capture speaker-related features more distinctly than Transformer-based models. These findings highlight Mamba-based SSL as a promising and complementary direction for long-sequence modeling, real-time speech modeling, and speech unit extraction.
Problem

Research questions and friction points this paper is trying to address.

Exploring Mamba for speech self-supervised learning models
Comparing Mamba-based and Transformer-based SSL architectures
Evaluating performance on long-context and streaming ASR tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-based HuBERT for speech SSL
Linear-time Selective State Space
Superior streaming ASR performance
๐Ÿ”Ž Similar Papers
No similar papers found.