VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

30 Jan 2024 | Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, Hui Zhang, Xie Chen, Kai Yu
VALL-T is a decoder-only generative Transducer model designed to enhance the robustness and controllability of text-to-speech (TTS) systems. It introduces shifting relative position embeddings for input phoneme sequences, explicitly guiding the monotonic generation process while maintaining the decoder-only Transformer architecture. This approach enables VALL-T to retain the capability of prompt-based zero-shot adaptation and significantly reduces the word error rate (WER) by 28.3% compared to previous models. Additionally, VALL-T allows the use of untranscribed speech prompts, even in unknown languages, and facilitates the synthesis of long speech through an aligned context window. The model addresses the limitations of decoder-only TTS systems, which often lack monotonic alignment constraints, leading to hallucination issues such as mispronunciation, word skipping, and repeating. VALL-T integrates a modularized Transducer architecture with a decoder-only Transformer, incorporating relative position embeddings to enforce monotonic alignment. This design enables the model to implicitly model phoneme durations without requiring explicit duration modeling, simplifying the training process and improving robustness against hallucinations. In experiments, VALL-T outperforms existing models in zero-shot TTS tasks, achieving a significantly lower WER and demonstrating better naturalness and speaker similarity. It also performs well in generating long speech, maintaining a limited context window for both input and output sequences. The model's alignment controllability allows for the use of untranscribed speech prompts, even in unknown languages, and supports streaming generation for lengthy utterances. VALL-T's architecture and training process are detailed, including the use of relative position embeddings, the Transducer loss function, and the integration of untranscribed speech prompts. The model's performance is validated through extensive experiments, showing its effectiveness in various TTS tasks and its ability to handle long speech synthesis with aligned context windows. Overall, VALL-T represents a significant advancement in TTS technology, offering improved robustness, controllability, and performance in zero-shot and long speech synthesis scenarios.VALL-T is a decoder-only generative Transducer model designed to enhance the robustness and controllability of text-to-speech (TTS) systems. It introduces shifting relative position embeddings for input phoneme sequences, explicitly guiding the monotonic generation process while maintaining the decoder-only Transformer architecture. This approach enables VALL-T to retain the capability of prompt-based zero-shot adaptation and significantly reduces the word error rate (WER) by 28.3% compared to previous models. Additionally, VALL-T allows the use of untranscribed speech prompts, even in unknown languages, and facilitates the synthesis of long speech through an aligned context window. The model addresses the limitations of decoder-only TTS systems, which often lack monotonic alignment constraints, leading to hallucination issues such as mispronunciation, word skipping, and repeating. VALL-T integrates a modularized Transducer architecture with a decoder-only Transformer, incorporating relative position embeddings to enforce monotonic alignment. This design enables the model to implicitly model phoneme durations without requiring explicit duration modeling, simplifying the training process and improving robustness against hallucinations. In experiments, VALL-T outperforms existing models in zero-shot TTS tasks, achieving a significantly lower WER and demonstrating better naturalness and speaker similarity. It also performs well in generating long speech, maintaining a limited context window for both input and output sequences. The model's alignment controllability allows for the use of untranscribed speech prompts, even in unknown languages, and supports streaming generation for lengthy utterances. VALL-T's architecture and training process are detailed, including the use of relative position embeddings, the Transducer loss function, and the integration of untranscribed speech prompts. The model's performance is validated through extensive experiments, showing its effectiveness in various TTS tasks and its ability to handle long speech synthesis with aligned context windows. Overall, VALL-T represents a significant advancement in TTS technology, offering improved robustness, controllability, and performance in zero-shot and long speech synthesis scenarios.
Reach us at info@study.space
[slides and audio] VALL-T%3A Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech