2024-2-7 | Geoffrey Cideron, Sertan Girgin, Mauro Verzetti, Damien Vincent, Matej Kastelic, Zalán Borsos, Brian McWilliams, Victor Ungureanu, Olivier Bachem, Olivier Pietquin, Matthieu Geist, Léonard Husseno, Neil Zeghidour and Andrea Agostinelli
MusicRL is the first music generation system fine-tuned from human feedback. It improves text-to-music models by integrating continuous human feedback during post-deployment fine-tuning. The system uses a pretrained autoregressive MusicLM model, which is fine-tuned with reinforcement learning to maximize sequence-level rewards. Reward functions are designed based on text adherence and audio quality, using human raters. MusicRL-R is fine-tuned on quality and text adherence rewards, while MusicRL-U is fine-tuned on a reward model of user preferences. MusicRL-RU combines both approaches, resulting in the best model according to human raters. Ablation studies show that text adherence and quality only account for part of human preferences, highlighting the subjectivity in musical appreciation. MusicRL addresses limitations of autoregressive generative models by using reinforcement learning with reward functions derived from automatic metrics, small-scale human ratings, and large-scale user feedback. The system collects 300,000 pairwise preferences from users, demonstrating the effectiveness of human feedback in improving music generation. MusicRL-R achieves an 83% win rate over the baseline, while MusicRL-U achieves a 74% win rate. MusicRL-RU outperforms all alternatives more than 62% of the time. The system shows that combining automatic rewards and human feedback leads to better performance. The results indicate that user preference feedback significantly improves audio quality while having minimal impact on text adherence. The study emphasizes the importance of integrating human feedback in music generation models to align with human preferences.MusicRL is the first music generation system fine-tuned from human feedback. It improves text-to-music models by integrating continuous human feedback during post-deployment fine-tuning. The system uses a pretrained autoregressive MusicLM model, which is fine-tuned with reinforcement learning to maximize sequence-level rewards. Reward functions are designed based on text adherence and audio quality, using human raters. MusicRL-R is fine-tuned on quality and text adherence rewards, while MusicRL-U is fine-tuned on a reward model of user preferences. MusicRL-RU combines both approaches, resulting in the best model according to human raters. Ablation studies show that text adherence and quality only account for part of human preferences, highlighting the subjectivity in musical appreciation. MusicRL addresses limitations of autoregressive generative models by using reinforcement learning with reward functions derived from automatic metrics, small-scale human ratings, and large-scale user feedback. The system collects 300,000 pairwise preferences from users, demonstrating the effectiveness of human feedback in improving music generation. MusicRL-R achieves an 83% win rate over the baseline, while MusicRL-U achieves a 74% win rate. MusicRL-RU outperforms all alternatives more than 62% of the time. The system shows that combining automatic rewards and human feedback leads to better performance. The results indicate that user preference feedback significantly improves audio quality while having minimal impact on text adherence. The study emphasizes the importance of integrating human feedback in music generation models to align with human preferences.