Linear RNNs for autoregressive generation of long music samples

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

This work addresses key challenges in autoregressive long-form music waveform generation—specifically, difficulty in modeling long-range dependencies and insufficient capture of multi-scale temporal structure. To this end, we propose HarmonicRNN, a linear state-space model (i.e., deep linear RNN) augmented with a context-parallel training mechanism. This design significantly improves sequence modeling efficiency and training stability, enabling autoregressive generation of waveforms up to one minute in duration (~1M tokens). Compared to conventional autoregressive models, HarmonicRNN achieves state-of-the-art (SOTA) log-likelihood scores and perceptual quality metrics—including Fréchet Audio Distance—on small-scale datasets. Notably, it is the first linear RNN-based architecture empirically validated for high-fidelity, scalable long-audio modeling, demonstrating both effectiveness and scalability within the linear RNN framework.

Technology Category

Application Category

📝 Abstract

Directly learning to generate audio waveforms in an autoregressive manner is a challenging task, due to the length of the raw sequences and the existence of important structure on many different timescales. Traditional approaches based on recurrent neural networks, as well as causal convolutions and self-attention, have only had limited success on this task. However, recent work has shown that deep state space models, also referred to as linear RNNs, can be highly efficient in this context. In this work, we push the boundaries of linear RNNs applied to raw audio modeling, investigating the effects of different architectural choices and using context-parallelism to enable training on sequences up to one minute (1M tokens) in length. We present a model, HarmonicRNN, which attains state of the art log-likelihoods and perceptual metrics on small-scale datasets.

Problem

Research questions and friction points this paper is trying to address.

Generating long audio samples autoregressively from waveforms

Addressing limitations of traditional RNNs and causal convolutions

Improving linear RNN efficiency for raw audio sequence modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear RNNs for autoregressive music generation

Context-parallelism enables long sequence training

HarmonicRNN achieves state-of-the-art performance

🔎 Similar Papers

No similar papers found.