[slides and audio] Fast Timing-Conditioned Latent Audio Diffusion

Stable Audio is a novel latent diffusion model designed to generate long-form, variable-length stereo music and sounds at 44.1kHz from text prompts. It addresses the computational demands of generating such content and the challenge of varying durations in music and sound effects. The model is based on a fully-convolutional variational autoencoder (VAE) that encodes 44.1kHz stereo audio into a latent space, allowing for efficient training and inference. Stable Audio is conditioned on both text prompts and timing embeddings, enabling fine control over the content and length of the generated output. It can produce up to 95 seconds of stereo audio in 8 seconds on an A100 GPU, making it one of the fastest models in its class. The paper introduces new evaluation metrics, including a Fréchet Distance based on OpenL3 embeddings, a Kullback-Leibler divergence for semantic correspondence, and a CLAP score for adherence to text prompts. These metrics are adapted to evaluate long-form, full-band stereo signals, which are not typically addressed by existing methods. Qualitative assessments cover audio quality, text alignment, musicality, stereo correctness, and musical structure. Stable Audio outperforms state-of-the-art models in both quantitative and qualitative evaluations, demonstrating superior audio quality, text alignment, and the ability to generate structured music and stereo sound effects. It is also significantly faster than autoregressive models and other latent diffusion models, making it a practical solution for generating high-quality, long-form audio content.Stable Audio is a novel latent diffusion model designed to generate long-form, variable-length stereo music and sounds at 44.1kHz from text prompts. It addresses the computational demands of generating such content and the challenge of varying durations in music and sound effects. The model is based on a fully-convolutional variational autoencoder (VAE) that encodes 44.1kHz stereo audio into a latent space, allowing for efficient training and inference. Stable Audio is conditioned on both text prompts and timing embeddings, enabling fine control over the content and length of the generated output. It can produce up to 95 seconds of stereo audio in 8 seconds on an A100 GPU, making it one of the fastest models in its class. The paper introduces new evaluation metrics, including a Fréchet Distance based on OpenL3 embeddings, a Kullback-Leibler divergence for semantic correspondence, and a CLAP score for adherence to text prompts. These metrics are adapted to evaluate long-form, full-band stereo signals, which are not typically addressed by existing methods. Qualitative assessments cover audio quality, text alignment, musicality, stereo correctness, and musical structure. Stable Audio outperforms state-of-the-art models in both quantitative and qualitative evaluations, demonstrating superior audio quality, text alignment, and the ability to generate structured music and stereo sound effects. It is also significantly faster than autoregressive models and other latent diffusion models, making it a practical solution for generating high-quality, long-form audio content.

Fast Timing-Conditioned Latent Audio Diffusion

2024 | Zach Evans, CJ Carr, Josiah Taylor, Scott H. Hawley, Jordi Pons