Linear RNNs for autoregressive generation of long music samples
By: Konrad Szewczyk, Daniel Gallo Fernández, James Townsend
Potential Business Impact:
Makes computers create realistic sounds and music.
Directly learning to generate audio waveforms in an autoregressive manner is a challenging task, due to the length of the raw sequences and the existence of important structure on many different timescales. Traditional approaches based on recurrent neural networks, as well as causal convolutions and self-attention, have only had limited success on this task. However, recent work has shown that deep state space models, also referred to as linear RNNs, can be highly efficient in this context. In this work, we push the boundaries of linear RNNs applied to raw audio modeling, investigating the effects of different architectural choices and using context-parallelism to enable training on sequences up to one minute (1M tokens) in length. We present a model, HarmonicRNN, which attains state of the art log-likelihoods and perceptual metrics on small-scale datasets.
Similar Papers
Time-Varying Audio Effect Modeling by End-to-End Adversarial Training
Sound
Makes music effects work without hidden controls.
Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention
CV and Pattern Recognition
Lets AI remember and create longer videos.
VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory
CV and Pattern Recognition
Creates longer, smoother, and more varied videos.