๐ค AI Summary
This work addresses efficiency and representational bottlenecks of Transformers in speech self-supervised learning (SSL), particularly for long-sequence modeling, streaming speech processing, and fine-grained unit extraction. We propose a Mamba-based HuBERT model, replacing the Transformer encoder with a Selective State Space Model (SSM). The model is comprehensively evaluated on streaming ASR fine-tuning and the SUPERB benchmark. Results demonstrate: (1) significantly reduced computational cost for long-context ASR fine-tuning; (2) superior streaming ASR performance over Transformer baselines; (3) enhanced accuracy on causal speech tasksโincluding automatic speech recognition and speaker verification; (4) more robust quantized representations; and (5) improved speaker feature disentanglement. To our knowledge, this is the first systematic exploration of Mamba architectures in speech SSL. Our findings establish a new paradigm for efficient, low-latency speech self-supervised learning.
๐ Abstract
While Mamba has demonstrated strong performance in language modeling, its potential as a speech self-supervised (SSL) model remains underexplored, with prior studies limited to isolated tasks. To address this, we explore Mamba-based HuBERT models as alternatives to Transformer-based SSL architectures. Leveraging the linear-time Selective State Space, these models enable fine-tuning on long-context ASR with significantly lower compute. Moreover, they show superior performance when fine-tuned for streaming ASR. Beyond fine-tuning, these models show competitive performance on SUPERB probing benchmarks, particularly in causal settings. Our analysis shows that they yield higher-quality quantized representations and capture speaker-related features more distinctly than Transformer-based models. These findings highlight Mamba-based SSL as a promising and complementary direction for long-sequence modeling, real-time speech modeling, and speech unit extraction.