21 Feb 2024 | Sifei Li, Yuxin Zhang, Fan Tang, Chongyang Ma, Weiming Dong, Changsheng Xu
This paper introduces a novel method for music style transfer using time-varying textual inversion. The method enables the transfer of music style from a given audio clip to another without altering its melody. It effectively captures musical attributes using minimal data and introduces a time-varying textual inversion module to precisely capture mel-spectrogram features at different levels. During inference, a bias-reduced stylization technique is proposed to obtain stable results. The method can transfer the style of specific instruments, as well as incorporate natural sounds to compose melodies. Experimental results demonstrate that the method outperforms existing approaches in both qualitative and quantitative evaluations. The method is based on the Riffusion model and utilizes a time-varying textual inversion module to achieve music stylization. The time-varying textual inversion module allows the text embeddings to shift from texture to structure of the style mel-spectrogram as the timestep increases. The method also incorporates a bias-reduced stylization technique to reduce the bias of diffusion models on content preservation. The method is evaluated on a small-scale dataset and compared with three related state-of-the-art approaches. The results show that the method achieves the best performance in terms of content preservation, style fit, and overall quality. The method is able to transfer music style from diverse audio sources, including instruments, natural sounds, and synthesized sound effects. However, it has limitations in specific contexts, such as when the content music encompasses multiple components or when the style audio incorporates white noise like rain or wind sounds. The method is believed to be effective in generating highly creative music with a high level of musicality. The method is supported by the National Natural Science Foundation of China.This paper introduces a novel method for music style transfer using time-varying textual inversion. The method enables the transfer of music style from a given audio clip to another without altering its melody. It effectively captures musical attributes using minimal data and introduces a time-varying textual inversion module to precisely capture mel-spectrogram features at different levels. During inference, a bias-reduced stylization technique is proposed to obtain stable results. The method can transfer the style of specific instruments, as well as incorporate natural sounds to compose melodies. Experimental results demonstrate that the method outperforms existing approaches in both qualitative and quantitative evaluations. The method is based on the Riffusion model and utilizes a time-varying textual inversion module to achieve music stylization. The time-varying textual inversion module allows the text embeddings to shift from texture to structure of the style mel-spectrogram as the timestep increases. The method also incorporates a bias-reduced stylization technique to reduce the bias of diffusion models on content preservation. The method is evaluated on a small-scale dataset and compared with three related state-of-the-art approaches. The results show that the method achieves the best performance in terms of content preservation, style fit, and overall quality. The method is able to transfer music style from diverse audio sources, including instruments, natural sounds, and synthesized sound effects. However, it has limitations in specific contexts, such as when the content music encompasses multiple components or when the style audio incorporates white noise like rain or wind sounds. The method is believed to be effective in generating highly creative music with a high level of musicality. The method is supported by the National Natural Science Foundation of China.