Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

2024 | Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, Soujanya Poria
This paper introduces Tango 2, a text-to-audio generation model that improves audio generation performance through direct preference optimization (DPO) on a preference dataset called Audio-alpaca. The model is trained using a diffusion-based approach, where the DPO loss is applied to align the generated audio with human preferences. The preference dataset is created by generating both preferred (winner) and undesirable (loser) audio outputs from text prompts, with the latter containing missing concepts or incorrect temporal orderings. The model is fine-tuned on this dataset to enhance audio generation quality and alignment with input prompts. The key contributions of this work include: (1) developing a cost-effective method for semi-automatically creating a preference dataset for text-to-audio generation; (2) sharing the Audio-alpaca dataset for future research; (3) demonstrating that TANGO 2 outperforms existing models like TANGO and AUDIOLDM2 in both objective and subjective evaluations; and (4) showing the applicability of diffusion-DPO in audio generation. The paper also discusses related work in text-to-audio generation, including models like AudioLDM, Make-an-Audio, Tango, and Audiogen, which use diffusion architectures for audio generation. It highlights the importance of aligning generated audio with human preferences and presents a novel approach using DPO to achieve this alignment. The methodology involves creating the Audio-alpaca dataset through three strategies: generating multiple audio samples from the same prompt, generating audio from perturbed prompts, and generating audio from temporally perturbed prompts. The dataset is then filtered based on CLAP scores to ensure the preferred audio samples are strongly aligned with the text prompts, while the undesirable samples are semantically close but not too similar. The model is trained using DPO loss, which allows the model to learn from both preferred and undesirable audio outputs. The results show that TANGO 2 significantly outperforms previous models in terms of objective metrics like FAD, KL divergence, IS, and CLAP scores, as well as subjective metrics like overall audio quality and relevance to the input text. The paper also highlights the effectiveness of temporal data augmentation in improving the model's performance on temporal tasks.This paper introduces Tango 2, a text-to-audio generation model that improves audio generation performance through direct preference optimization (DPO) on a preference dataset called Audio-alpaca. The model is trained using a diffusion-based approach, where the DPO loss is applied to align the generated audio with human preferences. The preference dataset is created by generating both preferred (winner) and undesirable (loser) audio outputs from text prompts, with the latter containing missing concepts or incorrect temporal orderings. The model is fine-tuned on this dataset to enhance audio generation quality and alignment with input prompts. The key contributions of this work include: (1) developing a cost-effective method for semi-automatically creating a preference dataset for text-to-audio generation; (2) sharing the Audio-alpaca dataset for future research; (3) demonstrating that TANGO 2 outperforms existing models like TANGO and AUDIOLDM2 in both objective and subjective evaluations; and (4) showing the applicability of diffusion-DPO in audio generation. The paper also discusses related work in text-to-audio generation, including models like AudioLDM, Make-an-Audio, Tango, and Audiogen, which use diffusion architectures for audio generation. It highlights the importance of aligning generated audio with human preferences and presents a novel approach using DPO to achieve this alignment. The methodology involves creating the Audio-alpaca dataset through three strategies: generating multiple audio samples from the same prompt, generating audio from perturbed prompts, and generating audio from temporally perturbed prompts. The dataset is then filtered based on CLAP scores to ensure the preferred audio samples are strongly aligned with the text prompts, while the undesirable samples are semantically close but not too similar. The model is trained using DPO loss, which allows the model to learn from both preferred and undesirable audio outputs. The results show that TANGO 2 significantly outperforms previous models in terms of objective metrics like FAD, KL divergence, IS, and CLAP scores, as well as subjective metrics like overall audio quality and relevance to the input text. The paper also highlights the effectiveness of temporal data augmentation in improving the model's performance on temporal tasks.
Reach us at info@study.space
Understanding Tango 2%3A Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization