VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

12 Jun 2024 | Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yangqing Liu, Sheng Zhao, Jinyu Li, Furu Wei
VALL-E R is a robust and efficient zero-shot text-to-speech (TTS) system that addresses the limitations of traditional TTS methods, particularly in terms of robustness and inference efficiency. The system introduces a phoneme monotonic alignment strategy to enhance the connection between phonemes and acoustic sequences, ensuring precise alignment and improving robustness. Additionally, it employs a codec-merging approach to downsample the discrete codes in the shallow quantization layer, reducing the number of autoregressive steps and significantly improving inference speed without compromising speech quality. Experimental results demonstrate that VALL-E R achieves strong controllability over phonemes, approaches the Word Error Rate (WER) of ground truth, and reduces inference time by over 60%. The system has potential applications in various fields, including speech synthesis for individuals with aphasia, education, entertainment, and assistive technologies.VALL-E R is a robust and efficient zero-shot text-to-speech (TTS) system that addresses the limitations of traditional TTS methods, particularly in terms of robustness and inference efficiency. The system introduces a phoneme monotonic alignment strategy to enhance the connection between phonemes and acoustic sequences, ensuring precise alignment and improving robustness. Additionally, it employs a codec-merging approach to downsample the discrete codes in the shallow quantization layer, reducing the number of autoregressive steps and significantly improving inference speed without compromising speech quality. Experimental results demonstrate that VALL-E R achieves strong controllability over phonemes, approaches the Word Error Rate (WER) of ground truth, and reduces inference time by over 60%. The system has potential applications in various fields, including speech synthesis for individuals with aphasia, education, entertainment, and assistive technologies.
Reach us at info@study.space
[slides] VALL-E R%3A Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment | StudySpace