Understanding STREAMVC%3A Real-Time Low-Latency Voice Conversion

StreamVC is a low-latency streaming voice conversion system that preserves the linguistic content and prosody of source speech while adapting the voice timbre to match target speech. Unlike previous methods, StreamVC achieves real-time performance on mobile platforms, making it suitable for applications like calls and video conferencing, and addressing use cases such as voice anonymization. The system leverages the architecture and training strategy of the SoundStream neural audio codec for lightweight, high-quality speech synthesis. It demonstrates the feasibility of learning soft speech units causally and the effectiveness of supplying whitened fundamental frequency (f0) information to improve pitch stability without leaking source timbre information. The system's design is inspired by Soft-VC and SoundStream. The content encoder uses a lightweight causal convolutional network to capture soft speech units, replacing the computationally intensive non-causal transformer network. The speaker encoder includes a per-frame encoding part and a global context aggregator. The decoder integrates the speaker latent embedding through FiLM layers to condition the synthesis process. The system also injects whitened f0 information to enhance pitch consistency without leaking source speaker information. For training, the system uses a pre-trained HuBERT model to derive pseudo-labels for soft speech units. The training strategy combines adversarial, feature, and reconstruction losses, along with a cross-entropy loss for the content encoder latent projection. The system achieves a low inference latency of 70.8 ms on a Pixel 7 smartphone while maintaining conversion quality comparable to or better than the state of the art. The system was evaluated on a dataset with 2262 source-target speech pairs. Results show that StreamVC achieves high naturalness, intelligibility, and f0 consistency, with the original model achieving the highest f0 PCC of 0.842. After fine-tuning on VCTK speakers, StreamVC achieves the second best score of 80.34%. The system demonstrates a trade-off between speaker similarity and f0 consistency, with fine-tuning improving speaker similarity but reducing f0 consistency. The ablation studies confirm the effectiveness of f0 injection and whitening in enhancing pitch consistency.StreamVC is a low-latency streaming voice conversion system that preserves the linguistic content and prosody of source speech while adapting the voice timbre to match target speech. Unlike previous methods, StreamVC achieves real-time performance on mobile platforms, making it suitable for applications like calls and video conferencing, and addressing use cases such as voice anonymization. The system leverages the architecture and training strategy of the SoundStream neural audio codec for lightweight, high-quality speech synthesis. It demonstrates the feasibility of learning soft speech units causally and the effectiveness of supplying whitened fundamental frequency (f0) information to improve pitch stability without leaking source timbre information. The system's design is inspired by Soft-VC and SoundStream. The content encoder uses a lightweight causal convolutional network to capture soft speech units, replacing the computationally intensive non-causal transformer network. The speaker encoder includes a per-frame encoding part and a global context aggregator. The decoder integrates the speaker latent embedding through FiLM layers to condition the synthesis process. The system also injects whitened f0 information to enhance pitch consistency without leaking source speaker information. For training, the system uses a pre-trained HuBERT model to derive pseudo-labels for soft speech units. The training strategy combines adversarial, feature, and reconstruction losses, along with a cross-entropy loss for the content encoder latent projection. The system achieves a low inference latency of 70.8 ms on a Pixel 7 smartphone while maintaining conversion quality comparable to or better than the state of the art. The system was evaluated on a dataset with 2262 source-target speech pairs. Results show that StreamVC achieves high naturalness, intelligibility, and f0 consistency, with the original model achieving the highest f0 PCC of 0.842. After fine-tuning on VCTK speakers, StreamVC achieves the second best score of 80.34%. The system demonstrates a trade-off between speaker similarity and f0 consistency, with fine-tuning improving speaker similarity but reducing f0 consistency. The ablation studies confirm the effectiveness of f0 injection and whitening in enhancing pitch consistency.

STREAMVC: REAL-TIME LOW-LATENCY VOICE CONVERSION

5 Jan 2024 | Yang Yang, Yury Kartynnik, Yunpeng Li, Jiuqiang Tang, Xing Li, George Sung, Matthias Grundmann