🤖 AI Summary
This work addresses key challenges in autoregressive long-form music waveform generation—specifically, difficulty in modeling long-range dependencies and insufficient capture of multi-scale temporal structure. To this end, we propose HarmonicRNN, a linear state-space model (i.e., deep linear RNN) augmented with a context-parallel training mechanism. This design significantly improves sequence modeling efficiency and training stability, enabling autoregressive generation of waveforms up to one minute in duration (~1M tokens). Compared to conventional autoregressive models, HarmonicRNN achieves state-of-the-art (SOTA) log-likelihood scores and perceptual quality metrics—including Fréchet Audio Distance—on small-scale datasets. Notably, it is the first linear RNN-based architecture empirically validated for high-fidelity, scalable long-audio modeling, demonstrating both effectiveness and scalability within the linear RNN framework.
📝 Abstract
Directly learning to generate audio waveforms in an autoregressive manner is a challenging task, due to the length of the raw sequences and the existence of important structure on many different timescales. Traditional approaches based on recurrent neural networks, as well as causal convolutions and self-attention, have only had limited success on this task. However, recent work has shown that deep state space models, also referred to as linear RNNs, can be highly efficient in this context. In this work, we push the boundaries of linear RNNs applied to raw audio modeling, investigating the effects of different architectural choices and using context-parallelism to enable training on sequences up to one minute (1M tokens) in length. We present a model, HarmonicRNN, which attains state of the art log-likelihoods and perceptual metrics on small-scale datasets.