1 Feb 2024 | Huan Liao, Haonan Han, Kai Yang, Tianjiao Du, Rui Yang, Zunnan Xu, Qinmei Xu, Jingquan Liu, Jiasheng Lu, Xiu Li
BATON is a framework designed to enhance the alignment between generated audio and text prompts using human preference feedback. The framework consists of three key stages: (1) generating a dataset of text-audio pairs with human-annotated feedback, (2) training a reward model to predict human preference based on the dataset, and (3) fine-tuning a text-to-audio model using the reward model to improve alignment with human preferences. The dataset includes 4,800 text-audio pairs across 200 audio event categories, with 2,700 samples annotated by human annotators. The reward model is trained to predict human preference for audio alignment, and the text-to-audio model is fine-tuned using reward-weighted likelihood maximization. Experimental results show that BATON significantly improves the generation quality of text-to-audio models in terms of audio integrity, temporal relationship, and alignment with human preference. BATON achieves a MOS-Q of 4.55 for integrity and a MOS-F of 4.41 for temporal relationship, surpassing the original model's scores. The framework demonstrates effective alignment between text prompts and generated audio, making it a valuable contribution to audio synthesis from textual inputs.BATON is a framework designed to enhance the alignment between generated audio and text prompts using human preference feedback. The framework consists of three key stages: (1) generating a dataset of text-audio pairs with human-annotated feedback, (2) training a reward model to predict human preference based on the dataset, and (3) fine-tuning a text-to-audio model using the reward model to improve alignment with human preferences. The dataset includes 4,800 text-audio pairs across 200 audio event categories, with 2,700 samples annotated by human annotators. The reward model is trained to predict human preference for audio alignment, and the text-to-audio model is fine-tuned using reward-weighted likelihood maximization. Experimental results show that BATON significantly improves the generation quality of text-to-audio models in terms of audio integrity, temporal relationship, and alignment with human preference. BATON achieves a MOS-Q of 4.55 for integrity and a MOS-F of 4.41 for temporal relationship, surpassing the original model's scores. The framework demonstrates effective alignment between text prompts and generated audio, making it a valuable contribution to audio synthesis from textual inputs.