[slides and audio] COMOSVC%3A Consistency Model-Based Singing Voice Conversion

CoMoSVC: A consistency model-based singing voice conversion method is proposed to achieve high-quality, high-similarity, and high-speed singing voice conversion. The method consists of two stages: the first stage encodes the extracted features and singer identity into embeddings, which are then used as conditional inputs for the second stage to generate mel-spectrograms. A diffusion-based teacher model is first designed to generate high-quality audio, and a student model is then distilled from it to achieve one-step sampling. The teacher model is trained using a diffusion-based approach, while the student model is distilled using consistency distillation to achieve one-step sampling. The sampling process of both models is described, with the student model achieving significantly faster inference speed than the teacher model. Experiments on two open-source datasets show that CoMoSVC achieves comparable or superior conversion performance compared to state-of-the-art diffusion-based SVC methods, while significantly improving inference speed. The method is evaluated using both objective and subjective metrics, demonstrating its effectiveness in achieving high-quality, high-similarity, and high-speed singing voice conversion.CoMoSVC: A consistency model-based singing voice conversion method is proposed to achieve high-quality, high-similarity, and high-speed singing voice conversion. The method consists of two stages: the first stage encodes the extracted features and singer identity into embeddings, which are then used as conditional inputs for the second stage to generate mel-spectrograms. A diffusion-based teacher model is first designed to generate high-quality audio, and a student model is then distilled from it to achieve one-step sampling. The teacher model is trained using a diffusion-based approach, while the student model is distilled using consistency distillation to achieve one-step sampling. The sampling process of both models is described, with the student model achieving significantly faster inference speed than the teacher model. Experiments on two open-source datasets show that CoMoSVC achieves comparable or superior conversion performance compared to state-of-the-art diffusion-based SVC methods, while significantly improving inference speed. The method is evaluated using both objective and subjective metrics, demonstrating its effectiveness in achieving high-quality, high-similarity, and high-speed singing voice conversion.

CoMoSVC: Consistency Model-based Singing Voice Conversion

3 Jan 2024 | Yiwen Lu¹, Zhen Ye¹, Wei Xue¹†, Xu Tan², Qifeng Liu¹, Yike Guo¹†