Understanding Tango 2%3A Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

This paper presents a method to improve text-to-audio generation by aligning the output audio with human preferences through direct preference optimization (DPO). The authors develop a synthetic preference dataset, Audio-alpaca, which includes diverse audio descriptions (prompts) paired with their preferred and undesirable audios. The preferred audios accurately reflect the textual descriptions, while the undesirable audios have flaws such as missing concepts or incorrect temporal order. The Tango model, a latent diffusion model for text-to-audio generation, is fine-tuned using DPO loss on this dataset to generate more semantically aligned audio. The experimental results show that the fine-tuned model, TANGO 2, outperforms the original Tango and AudioLDM2 in both objective and subjective evaluations, demonstrating the effectiveness of DPO in enhancing audio generation quality and semantic alignment. The paper also discusses the importance of temporal data augmentation and the impact of different filtering strategies on the performance of TANGO 2.This paper presents a method to improve text-to-audio generation by aligning the output audio with human preferences through direct preference optimization (DPO). The authors develop a synthetic preference dataset, Audio-alpaca, which includes diverse audio descriptions (prompts) paired with their preferred and undesirable audios. The preferred audios accurately reflect the textual descriptions, while the undesirable audios have flaws such as missing concepts or incorrect temporal order. The Tango model, a latent diffusion model for text-to-audio generation, is fine-tuned using DPO loss on this dataset to generate more semantically aligned audio. The experimental results show that the fine-tuned model, TANGO 2, outperforms the original Tango and AudioLDM2 in both objective and subjective evaluations, demonstrating the effectiveness of DPO in enhancing audio generation quality and semantic alignment. The paper also discusses the importance of temporal data augmentation and the impact of different filtering strategies on the performance of TANGO 2.

Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

2024 | Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, Soujanya Poria