HELIX: Scaling Raw Audio Understanding with Hybrid Mamba-Attention Beyond the Quadratic Limit

📅 2026-03-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

273K/year
🤖 AI Summary
Existing research on audio representation learning often evaluates design factors—such as input frontends, backbone architectures, and sequence lengths—in isolation, yielding conclusions that are difficult to generalize. This work proposes HELIX, a framework that systematically compares pure Mamba, pure attention, and lightweight hybrid Mamba-Attention architectures under matched parameter conditions, revealing strong coupling effects between input representations and backbone networks. HELIX effectively overcomes the memory and performance bottlenecks associated with modeling long audio sequences—up to five minutes (30,000 tokens). Evaluations across six benchmark tasks demonstrate that HELIX improves speaker identification accuracy by 11.5 percentage points over pure Mamba on five-minute utterances, while pure attention fails to run due to out-of-memory errors.

Technology Category

Application Category

📝 Abstract
Audio representation learning typically evaluates design choices such as input frontend, sequence backbone, and sequence length in isolation. We show that these axes are coupled, and conclusions from one setting often do not transfer to others. We introduce HELIX, a controlled framework comparing pure Mamba, pure attention, and a minimal hybrid with a single attention bottleneck. All models are parameter-matched at about 8.3M parameters to isolate architectural effects. Across six datasets, we find that the preferred input representation depends on the backbone, and that attention hurts performance on short, stationary audio but becomes important at longer sequence lengths. On a 5-minute speaker identification task with 30,000 tokens, pure attention fails with out-of-memory errors, while HELIX closes an 11.5-point gap over pure Mamba.
Problem

Research questions and friction points this paper is trying to address.

audio representation learning
sequence modeling
long-context audio
architectural coupling
quadratic complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba
attention mechanism
hybrid architecture
long-sequence audio modeling
parameter-matched comparison