Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation

2 Jan 2024 | Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li
Auffusion is a text-to-audio (TTA) generation model that leverages the power of diffusion models and large language models (LLMs) to generate audio from natural language prompts. Inspired by state-of-the-art Text-to-Image (T2I) diffusion models, Auffusion adapts T2I model frameworks to TTA tasks by effectively leveraging their generative strengths and precise cross-modal alignment. The model integrates a powerful pretrained Latent Diffusion Model (LDM) from T2I tasks, enabling it to inherit generative strengths and enhance cross-modal alignment for TTA applications. Auffusion surpasses previous TTA approaches using limited data and computational resources, demonstrating superior performance in generating audios that accurately match textual descriptions. Comprehensive ablation studies and innovative cross-attention map visualizations provide insightful assessments of text-audio alignment in TTA. Auffusion's performance is comparable to other baseline models trained on much larger datasets, and it excels in text-audio alignment, as demonstrated in several related tasks such as audio style transfer, inpainting, and other manipulations. The model features a carefully designed feature space transformation pipeline, enabling lossless audio conversion. Auffusion also demonstrates the importance of the conditioning process in enhancing the audio-text model's ability to extract key information from text descriptions and accurately match the desired audio. The model's text encoder is a critical bridge between text and audio, and different text encoders have varying impacts on cross-modal alignment. Auffusion's findings reveal that the pretrained LDM is capable of adequately transferring cross-modal understanding ability from T2I to TTA tasks, resulting in better alignment. The model's performance is evaluated using objective metrics such as Frechet Distance (FD), Frechet Audio Distance (FAD), Kullback–Leibler (KL) divergence, Inception Score (IS), and CLAP score, as well as subjective evaluations by human raters. Auffusion outperforms other baseline models in both objective and subjective evaluations, demonstrating its superior ability to generate audio that accurately aligns with text descriptions. The model's performance is also evaluated against the number of events in the text, showing its ability to handle fine-grained text-audio alignment. Auffusion's applications include audio style transfer, audio inpainting, and attention-based techniques such as word swap and text attention re-weighting. The model's results demonstrate its versatility and effectiveness in generating audio that accurately reflects the given captions.Auffusion is a text-to-audio (TTA) generation model that leverages the power of diffusion models and large language models (LLMs) to generate audio from natural language prompts. Inspired by state-of-the-art Text-to-Image (T2I) diffusion models, Auffusion adapts T2I model frameworks to TTA tasks by effectively leveraging their generative strengths and precise cross-modal alignment. The model integrates a powerful pretrained Latent Diffusion Model (LDM) from T2I tasks, enabling it to inherit generative strengths and enhance cross-modal alignment for TTA applications. Auffusion surpasses previous TTA approaches using limited data and computational resources, demonstrating superior performance in generating audios that accurately match textual descriptions. Comprehensive ablation studies and innovative cross-attention map visualizations provide insightful assessments of text-audio alignment in TTA. Auffusion's performance is comparable to other baseline models trained on much larger datasets, and it excels in text-audio alignment, as demonstrated in several related tasks such as audio style transfer, inpainting, and other manipulations. The model features a carefully designed feature space transformation pipeline, enabling lossless audio conversion. Auffusion also demonstrates the importance of the conditioning process in enhancing the audio-text model's ability to extract key information from text descriptions and accurately match the desired audio. The model's text encoder is a critical bridge between text and audio, and different text encoders have varying impacts on cross-modal alignment. Auffusion's findings reveal that the pretrained LDM is capable of adequately transferring cross-modal understanding ability from T2I to TTA tasks, resulting in better alignment. The model's performance is evaluated using objective metrics such as Frechet Distance (FD), Frechet Audio Distance (FAD), Kullback–Leibler (KL) divergence, Inception Score (IS), and CLAP score, as well as subjective evaluations by human raters. Auffusion outperforms other baseline models in both objective and subjective evaluations, demonstrating its superior ability to generate audio that accurately aligns with text descriptions. The model's performance is also evaluated against the number of events in the text, showing its ability to handle fine-grained text-audio alignment. Auffusion's applications include audio style transfer, audio inpainting, and attention-based techniques such as word swap and text attention re-weighting. The model's results demonstrate its versatility and effectiveness in generating audio that accurately reflects the given captions.
Reach us at info@study.space