[slides] RALL-E%3A Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

RALL-E is a robust codec language modeling method designed for text-to-speech (TTS) synthesis, addressing the issue of poor robustness in LLM-based TTS systems. The core idea of RALL-E is to use chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. Specifically, RALL-E predicts prosody features (pitch and duration) before predicting speech tokens, using them as intermediate conditions to stabilize the generation process. Additionally, RALL-E employs duration-guided masking to enforce the model to focus on relevant phonemes and prosody features during the prediction of speech tokens. Comprehensive objective and subjective evaluations demonstrate that RALL-E significantly improves the word error rate (WER) of zero-shot TTS, reducing it from 5.6% (without reranking) and 1.7% (with reranking) to 2.5% and 1.0%, respectively, compared to the baseline method VALL-E. Furthermore, RALL-E correctly synthesizes hard sentences that are challenging for VALL-E, reducing the error rate from 68% to 4%. The contributions of RALL-E include improving the robustness of LLM-based TTS by incorporating prosody tokens and using duration-guided masking, conducting extensive evaluations, and demonstrating superior performance on hard sentences.RALL-E is a robust codec language modeling method designed for text-to-speech (TTS) synthesis, addressing the issue of poor robustness in LLM-based TTS systems. The core idea of RALL-E is to use chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. Specifically, RALL-E predicts prosody features (pitch and duration) before predicting speech tokens, using them as intermediate conditions to stabilize the generation process. Additionally, RALL-E employs duration-guided masking to enforce the model to focus on relevant phonemes and prosody features during the prediction of speech tokens. Comprehensive objective and subjective evaluations demonstrate that RALL-E significantly improves the word error rate (WER) of zero-shot TTS, reducing it from 5.6% (without reranking) and 1.7% (with reranking) to 2.5% and 1.0%, respectively, compared to the baseline method VALL-E. Furthermore, RALL-E correctly synthesizes hard sentences that are challenging for VALL-E, reducing the error rate from 68% to 4%. The contributions of RALL-E include improving the robustness of LLM-based TTS by incorporating prosody tokens and using duration-guided masking, conducting extensive evaluations, and demonstrating superior performance on hard sentences.

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

19 May 2024 | Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, Sheng Zhao