Understanding Improving Text-To-Audio Models with Synthetic Captions

The paper "Improving Text-To-Audio Models with Synthetic Captions" addresses the challenge of obtaining high-quality training data, particularly captions, for text-to-audio (TTA) models. The authors propose an audio captioning pipeline that uses an audio language model to synthesize accurate and diverse captions for audio at scale. This pipeline generates a large dataset of synthetic captions called *AF-AudioSet* for the AudioSet dataset. The benefits of pre-training TTA models on these synthetic captions are evaluated using benchmarks such as AudioCaps and MusicCaps. The results show that pre-training on *AF-AudioSet* significantly improves the quality of audio generation, achieving state-of-the-art performance. The study also explores different data filtering and combination strategies, model sizes, and architectural designs, finding optimal pre-training recipes across various settings. The contributions include a data labeling pipeline for generating large-scale synthetic captions, the creation of *AF-AudioSet*, and the demonstration of state-of-the-art models through pre-training on this dataset.The paper "Improving Text-To-Audio Models with Synthetic Captions" addresses the challenge of obtaining high-quality training data, particularly captions, for text-to-audio (TTA) models. The authors propose an audio captioning pipeline that uses an audio language model to synthesize accurate and diverse captions for audio at scale. This pipeline generates a large dataset of synthetic captions called *AF-AudioSet* for the AudioSet dataset. The benefits of pre-training TTA models on these synthetic captions are evaluated using benchmarks such as AudioCaps and MusicCaps. The results show that pre-training on *AF-AudioSet* significantly improves the quality of audio generation, achieving state-of-the-art performance. The study also explores different data filtering and combination strategies, model sizes, and architectural designs, finding optimal pre-training recipes across various settings. The contributions include a data labeling pipeline for generating large-scale synthetic captions, the creation of *AF-AudioSet*, and the demonstration of state-of-the-art models through pre-training on this dataset.

Improving Text-To-Audio Models with Synthetic Captions

8 Jul 2024 | Zhifeng Kong*1, Sang-gil Lee*1, Deepanway Ghosal2, Navonil Majumder2, Ambuj Mehrish2, Rafael Valle1, Soujanya Poria2, Bryan Catanzaro1

8 Jul 2024 | Zhifeng Kong1, Sang-gil Lee1, Deepanway Ghosal2, Navonil Majumder2, Ambuj Mehrish2, Rafael Valle1, Soujanya Poria2, Bryan Catanzaro1