This paper presents a method to generate high-quality synthetic captions for audio using an audio language model, which is then used to pretrain text-to-audio models. The authors propose a pipeline that generates a large-scale synthetic caption dataset called AF-AudioSet. They evaluate the effectiveness of pretraining text-to-audio models on this dataset and find that it significantly improves audio generation quality. The paper also explores the optimal trade-off between caption quality and data size, and compares the performance of different text encoders and model sizes. The results show that pretraining on AF-AudioSet leads to state-of-the-art performance on both text-to-audio and text-to-music tasks. The authors also investigate the effect of mixing synthetic and real data during pretraining and find that it further improves generation quality. The study demonstrates that using synthetic captions generated by an audio language model can effectively enhance the performance of text-to-audio models.This paper presents a method to generate high-quality synthetic captions for audio using an audio language model, which is then used to pretrain text-to-audio models. The authors propose a pipeline that generates a large-scale synthetic caption dataset called AF-AudioSet. They evaluate the effectiveness of pretraining text-to-audio models on this dataset and find that it significantly improves audio generation quality. The paper also explores the optimal trade-off between caption quality and data size, and compares the performance of different text encoders and model sizes. The results show that pretraining on AF-AudioSet leads to state-of-the-art performance on both text-to-audio and text-to-music tasks. The authors also investigate the effect of mixing synthetic and real data during pretraining and find that it further improves generation quality. The study demonstrates that using synthetic captions generated by an audio language model can effectively enhance the performance of text-to-audio models.