STREAMVC: REAL-TIME LOW-LATENCY VOICE CONVERSION

STREAMVC: REAL-TIME LOW-LATENCY VOICE CONVERSION

5 Jan 2024 | Yang Yang, Yury Kartynnik, Yunpeng Li, Jiuqiang Tang, Xing Li, George Sung, Matthias Grundmann
StreamVC is a streaming voice conversion solution designed to preserve the content and prosody of source speech while matching the voice timbre from target speech. Unlike previous approaches, StreamVC produces the resulting waveform at low latency, making it suitable for real-time communication scenarios such as calls and video conferencing. The solution leverages the architecture and training strategy of the SoundStream neural audio codec for lightweight, high-quality speech synthesis. Key contributions include: 1. Using a lightweight causal convolutional network to capture soft speech unit information. 2. Achieving high-quality speech synthesis with on-device low-latency streaming inference. 3. Introducing the injection of whitened fundamental frequency (f0) information to improve pitch stability without leaking source timbre. The system is trained using a combination of adversarial, feature, and reconstruction losses, along with a cross-entropy loss for pseudo-label prediction. The model achieves an end-to-end latency of 70.8 ms on a Pixel 7 smartphone, demonstrating its efficiency and effectiveness in real-time applications. Evaluation results show that StreamVC performs well in terms of naturalness, intelligibility, speaker similarity, and f0 consistency, outperforming or matching existing state-of-the-art methods.StreamVC is a streaming voice conversion solution designed to preserve the content and prosody of source speech while matching the voice timbre from target speech. Unlike previous approaches, StreamVC produces the resulting waveform at low latency, making it suitable for real-time communication scenarios such as calls and video conferencing. The solution leverages the architecture and training strategy of the SoundStream neural audio codec for lightweight, high-quality speech synthesis. Key contributions include: 1. Using a lightweight causal convolutional network to capture soft speech unit information. 2. Achieving high-quality speech synthesis with on-device low-latency streaming inference. 3. Introducing the injection of whitened fundamental frequency (f0) information to improve pitch stability without leaking source timbre. The system is trained using a combination of adversarial, feature, and reconstruction losses, along with a cross-entropy loss for pseudo-label prediction. The model achieves an end-to-end latency of 70.8 ms on a Pixel 7 smartphone, demonstrating its efficiency and effectiveness in real-time applications. Evaluation results show that StreamVC performs well in terms of naturalness, intelligibility, speaker similarity, and f0 consistency, outperforming or matching existing state-of-the-art methods.
Reach us at info@study.space