31 Jul 2024 | Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons
Stability AI has released an open-source text-to-audio model called Stable Audio Open. This model is trained on Creative Commons licensed audio data and is designed to generate high-quality stereo audio at 44.1kHz. The model's architecture includes an autoencoder, a T5-based text embedding, and a transformer-based diffusion model (DiT). The autoencoder compresses audio into a manageable sequence length, while the DiT generates audio from text prompts. The model is trained on a diverse set of audio data from Freesound and the Free Music Archive (FMA), ensuring no copyrighted content is included. The training data was carefully curated to avoid any potential copyright issues.
The model's performance is evaluated using various metrics, including FD_open13, which measures the realism of generated audio. The results show that the model performs competitively with state-of-the-art models. The model is also evaluated for its ability to generate variable-length audio, with the ability to generate audio up to 47 seconds in length. The model is trained to fill the rest of the audio with silence, which can be trimmed for shorter outputs.
The model is also evaluated for its audio reconstruction quality, with results showing that it performs well compared to other audio codecs. The model is also tested for memorization of training data, with no instances of exact copies found. The model is capable of running on consumer-grade GPUs, making it accessible for both academic and artistic use cases.
The model has some limitations, including its inability to generate intelligible speech or singing, and its performance on music generation is not as strong as some state-of-the-art models. However, it is still a competitive open-source model for generating high-quality stereo audio. The model is released with publicly available weights and code, making it a valuable resource for the research community.Stability AI has released an open-source text-to-audio model called Stable Audio Open. This model is trained on Creative Commons licensed audio data and is designed to generate high-quality stereo audio at 44.1kHz. The model's architecture includes an autoencoder, a T5-based text embedding, and a transformer-based diffusion model (DiT). The autoencoder compresses audio into a manageable sequence length, while the DiT generates audio from text prompts. The model is trained on a diverse set of audio data from Freesound and the Free Music Archive (FMA), ensuring no copyrighted content is included. The training data was carefully curated to avoid any potential copyright issues.
The model's performance is evaluated using various metrics, including FD_open13, which measures the realism of generated audio. The results show that the model performs competitively with state-of-the-art models. The model is also evaluated for its ability to generate variable-length audio, with the ability to generate audio up to 47 seconds in length. The model is trained to fill the rest of the audio with silence, which can be trimmed for shorter outputs.
The model is also evaluated for its audio reconstruction quality, with results showing that it performs well compared to other audio codecs. The model is also tested for memorization of training data, with no instances of exact copies found. The model is capable of running on consumer-grade GPUs, making it accessible for both academic and artistic use cases.
The model has some limitations, including its inability to generate intelligible speech or singing, and its performance on music generation is not as strong as some state-of-the-art models. However, it is still a competitive open-source model for generating high-quality stereo audio. The model is released with publicly available weights and code, making it a valuable resource for the research community.