[slides] VALL-E R%3A Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

VALL-E R is a robust and efficient zero-shot text-to-speech (TTS) system that improves upon the original VALL-E model. It introduces a phoneme monotonic alignment strategy to enhance the connection between phonemes and acoustic sequences, ensuring more precise alignment by constraining acoustic tokens to match their associated phonemes. Additionally, it employs a codec-merging approach to downsample discrete codes in the shallow quantization layer, accelerating decoding speed while preserving speech quality. These strategies enable VALL-E R to achieve strong robustness, with performance approaching that of ground truth, and significantly reduce inference time by over 60%. The system is designed for zero-shot TTS, allowing it to generate speech for unseen speakers without fine-tuning. It can also control prosody by using phoneme-based input, enabling timbre cloning and voice conversion. VALL-E R is a research project with no immediate commercial applications, but it has potential for use in various fields, including education, entertainment, and assistive technologies. The system's performance is evaluated using objective and subjective metrics, demonstrating its effectiveness in speech synthesis. The model's efficiency is further improved by reducing the sampling rate of the first layer of the codec, which enhances inference speed without compromising audio quality. The system also shows strong robustness, with results indicating that it can maintain high performance even when faced with complex or long sequences. Overall, VALL-E R provides a significant advancement in zero-shot TTS, offering a balance between efficiency and robustness.VALL-E R is a robust and efficient zero-shot text-to-speech (TTS) system that improves upon the original VALL-E model. It introduces a phoneme monotonic alignment strategy to enhance the connection between phonemes and acoustic sequences, ensuring more precise alignment by constraining acoustic tokens to match their associated phonemes. Additionally, it employs a codec-merging approach to downsample discrete codes in the shallow quantization layer, accelerating decoding speed while preserving speech quality. These strategies enable VALL-E R to achieve strong robustness, with performance approaching that of ground truth, and significantly reduce inference time by over 60%. The system is designed for zero-shot TTS, allowing it to generate speech for unseen speakers without fine-tuning. It can also control prosody by using phoneme-based input, enabling timbre cloning and voice conversion. VALL-E R is a research project with no immediate commercial applications, but it has potential for use in various fields, including education, entertainment, and assistive technologies. The system's performance is evaluated using objective and subjective metrics, demonstrating its effectiveness in speech synthesis. The model's efficiency is further improved by reducing the sampling rate of the first layer of the codec, which enhances inference speed without compromising audio quality. The system also shows strong robustness, with results indicating that it can maintain high performance even when faced with complex or long sequences. Overall, VALL-E R provides a significant advancement in zero-shot TTS, offering a balance between efficiency and robustness.

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

2024-06-12 | Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, Furu Wei