2024 | Zach Evans, CJ Carr, Josiah Taylor, Scott H. Hawley, Jordi Pons
Stable Audio is a latent diffusion model that generates long-form, variable-length stereo audio at 44.1kHz from text prompts, with timing embeddings for control over content and length. It uses a variational autoencoder (VAE) to compress audio into a latent space, enabling faster inference and generation. The model is conditioned on text and timing embeddings, allowing for precise control over the generated audio's length and content. It can generate up to 95 seconds of stereo audio in 8 seconds on an A100 GPU. Stable Audio outperforms other models in two public benchmarks and can generate structured music with stereo sound effects, unlike state-of-the-art models. The model uses novel quantitative and qualitative metrics to evaluate long-form full-band stereo audio, including Fréchet Distance based on OpenL3 embeddings, Kullback-Leibler divergence, and CLAP scores. It also conducts a qualitative study assessing audio quality, text alignment, musicality, stereo correctness, and musical structure. The model is trained on a large dataset of music, sound effects, and instrument stems, and uses a CLAP-based text encoder. It is compared to other models like AudioLDM2, MusicGen, and AudioGen, and is found to be competitive in audio quality, text alignment, musicality, and stereo generation. The model is efficient, fast, and capable of generating variable-length audio, making it suitable for music and sound effect generation. The research highlights the importance of timing conditioning and the need for new metrics to evaluate long-form stereo audio generation.Stable Audio is a latent diffusion model that generates long-form, variable-length stereo audio at 44.1kHz from text prompts, with timing embeddings for control over content and length. It uses a variational autoencoder (VAE) to compress audio into a latent space, enabling faster inference and generation. The model is conditioned on text and timing embeddings, allowing for precise control over the generated audio's length and content. It can generate up to 95 seconds of stereo audio in 8 seconds on an A100 GPU. Stable Audio outperforms other models in two public benchmarks and can generate structured music with stereo sound effects, unlike state-of-the-art models. The model uses novel quantitative and qualitative metrics to evaluate long-form full-band stereo audio, including Fréchet Distance based on OpenL3 embeddings, Kullback-Leibler divergence, and CLAP scores. It also conducts a qualitative study assessing audio quality, text alignment, musicality, stereo correctness, and musical structure. The model is trained on a large dataset of music, sound effects, and instrument stems, and uses a CLAP-based text encoder. It is compared to other models like AudioLDM2, MusicGen, and AudioGen, and is found to be competitive in audio quality, text alignment, musicality, and stereo generation. The model is efficient, fast, and capable of generating variable-length audio, making it suitable for music and sound effect generation. The research highlights the importance of timing conditioning and the need for new metrics to evaluate long-form stereo audio generation.