VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

30 Jan 2024 | Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, Hui Zhang, Xie Chen, Kai Yu
VALL-T is a decoder-only generative Transducer model designed to enhance the robustness and controllability of Text-to-Speech (TTS) systems. It introduces shifting relative position embeddings for the input phoneme sequence, explicitly indicating the monotonic generation process while maintaining the decoder-only Transformer architecture. This approach addresses the issue of hallucination, such as mispronunciation, word skipping, and repeating, which can occur in decoder-only TTS models due to the lack of monotonic alignment constraints. VALL-T retains the capability of prompt-based zero-shot adaptation and demonstrates better robustness against hallucinations, with a 28.3% reduction in word error rate. Additionally, the controllability of alignment during decoding allows for the use of untranscribed speech prompts, even in unknown languages, and enables the synthesis of lengthy speech by utilizing an aligned context window. The model's performance is evaluated through various experiments, including zero-shot TTS, leveraging untranscribed speech prompts, and generating lengthy speech, showing superior results compared to existing models.VALL-T is a decoder-only generative Transducer model designed to enhance the robustness and controllability of Text-to-Speech (TTS) systems. It introduces shifting relative position embeddings for the input phoneme sequence, explicitly indicating the monotonic generation process while maintaining the decoder-only Transformer architecture. This approach addresses the issue of hallucination, such as mispronunciation, word skipping, and repeating, which can occur in decoder-only TTS models due to the lack of monotonic alignment constraints. VALL-T retains the capability of prompt-based zero-shot adaptation and demonstrates better robustness against hallucinations, with a 28.3% reduction in word error rate. Additionally, the controllability of alignment during decoding allows for the use of untranscribed speech prompts, even in unknown languages, and enables the synthesis of lengthy speech by utilizing an aligned context window. The model's performance is evaluated through various experiments, including zero-shot TTS, leveraging untranscribed speech prompts, and generating lengthy speech, showing superior results compared to existing models.
Reach us at info@study.space
[slides and audio] VALL-T%3A Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech