Abstract

In recent years, advances in discrete acoustic token modeling have resulted in significant leaps in autoregressive generation of speech. Meanwhile, approaches that use non-autoregressive parallel iterative decoding have been developed for efficient image synthesis. Parallel iterative decoding promises to allow faster inference than autoregressive methods and is more suited to tasks like infill, which require conditioning on both past and future sequence elements. In this work, we combine parallel iterative decoding with acoustic token modeling, and apply them to music audio synthesis.  To the best of our knowledge, ours is the first extension of parallel iterative decoding to neural audio music generation. Our model can be flexibly applied to a variety of applications via token-based prompting. We show that we can guide generation with selectively masked music token sequences, asking it to fill in the blanks. The outputs of this procedure can range from a high-quality audio compression technique to variations on the original input music that match the original input music in terms of style, genre, beat and instrumentation, while varying specifics of timbre and rhythm.

Video Recording