1 Feb 2024 | Huan Liao, Haonan Han, Kai Yang, Tianjiao Du, Rui Yang, Zunnan Xu, Qinmei Xu, Jingquan Liu, Jiasheng Lu, Xiu Li
The paper introduces BATON, a framework designed to enhance the alignment between generated audio and text prompts using human preference feedback. The framework consists of three key stages: (1) generating a dataset of text-audio pairs with human annotations, (2) training an audio reward model to predict human preferences, and (3) fine-tuning an off-the-shelf text-to-audio (TTA) model using the reward model. The experiments demonstrate that BATON significantly improves the alignment of generated audio with human preferences, achieving gains of +2.3% and +6.0% in CLAP scores for integrity and temporal relationship tasks, respectively. Human annotators also rated BATON's performance higher than the original model, with a MOS-Q of 4.55 for integrity and a MOS-F of 4.41 for temporal relationship tasks. The paper discusses the limitations and future directions, emphasizing the importance of human feedback and the potential for further exploration in reinforcement learning for online fine-tuning.The paper introduces BATON, a framework designed to enhance the alignment between generated audio and text prompts using human preference feedback. The framework consists of three key stages: (1) generating a dataset of text-audio pairs with human annotations, (2) training an audio reward model to predict human preferences, and (3) fine-tuning an off-the-shelf text-to-audio (TTA) model using the reward model. The experiments demonstrate that BATON significantly improves the alignment of generated audio with human preferences, achieving gains of +2.3% and +6.0% in CLAP scores for integrity and temporal relationship tasks, respectively. Human annotators also rated BATON's performance higher than the original model, with a MOS-Q of 4.55 for integrity and a MOS-F of 4.41 for temporal relationship tasks. The paper discusses the limitations and future directions, emphasizing the importance of human feedback and the potential for further exploration in reinforcement learning for online fine-tuning.