STABLE AUDIO OPEN

STABLE AUDIO OPEN

ICML 2024 | Zach Evans Julian D. Parker CJ Carr Zack Zukowski Josiah Taylor Jordi Pons
The paper introduces Stable Audio Open, an open-weight text-to-audio model trained with Creative Commons (CC) data. The model is designed to be accessible for artists and researchers, addressing the lack of public weights and well-documented training data in existing models. Key features include: - **Architecture**: The model consists of an autoencoder, a T5-based text embedding, and a transformer-based diffusion model (DiT). It can generate variable-length stereo audio at 44.1kHz. - **Training Data**: The dataset includes CC-licensed recordings from Freesound and the Free Music Archive (FMA), ensuring no copyrighted content. - **Evaluation**: The model performs competitively with state-of-the-art models across various metrics, particularly in generating high-quality stereo sound synthesis. - **Autoencoder**: The autoencoder is trained to reconstruct audio from waveforms, achieving near-similar performance to Stable Audio 2.0 despite being trained on CC data. - **Inference Speed**: The model runs efficiently on consumer-grade GPUs, with inference speeds of 8 steps/sec on an RTX3090, 11 steps/sec on an RTX-A6000, and 20 steps/sec on an H100. - **Limitations**: The model struggles with generating speech and music, especially with copyrighted content, and requires prompt engineering for optimal results. The paper also includes detailed evaluations, comparisons with baselines, and a thorough analysis of memorization and VRAM usage during inference.The paper introduces Stable Audio Open, an open-weight text-to-audio model trained with Creative Commons (CC) data. The model is designed to be accessible for artists and researchers, addressing the lack of public weights and well-documented training data in existing models. Key features include: - **Architecture**: The model consists of an autoencoder, a T5-based text embedding, and a transformer-based diffusion model (DiT). It can generate variable-length stereo audio at 44.1kHz. - **Training Data**: The dataset includes CC-licensed recordings from Freesound and the Free Music Archive (FMA), ensuring no copyrighted content. - **Evaluation**: The model performs competitively with state-of-the-art models across various metrics, particularly in generating high-quality stereo sound synthesis. - **Autoencoder**: The autoencoder is trained to reconstruct audio from waveforms, achieving near-similar performance to Stable Audio 2.0 despite being trained on CC data. - **Inference Speed**: The model runs efficiently on consumer-grade GPUs, with inference speeds of 8 steps/sec on an RTX3090, 11 steps/sec on an RTX-A6000, and 20 steps/sec on an H100. - **Limitations**: The model struggles with generating speech and music, especially with copyrighted content, and requires prompt engineering for optimal results. The paper also includes detailed evaluations, comparisons with baselines, and a thorough analysis of memorization and VRAM usage during inference.
Reach us at info@study.space