RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

19 May 2024 | Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, Sheng Zhao
RALL-E is a robust language modeling method for text-to-speech (TTS) synthesis that improves the robustness of large language model (LLM)-based TTS. The core idea of RALL-E is chain-of-thought (CoT) prompting, which decomposes the TTS task into simpler steps to enhance the robustness of LLM-based TTS. RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Comprehensive objective and subjective evaluations demonstrate that RALL-E significantly improves the word error rate (WER) of zero-shot TTS compared to a powerful baseline method VALL-E. RALL-E reduces the WER from 5.6% (without reranking) and 1.7% (with reranking) to 2.5% and 1.0%, respectively. Furthermore, RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from 68% to 4%. RALL-E also shows superior performance in terms of speech naturalness and speaker similarity. The contributions of this work include presenting RALL-E, a robust codec language modeling method with chain-of-thought prompting for TTS, conducting comprehensive objective and subjective evaluations, and evaluating RALL-E on particularly hard sentences. RALL-E demonstrates superior robustness in all evaluations.RALL-E is a robust language modeling method for text-to-speech (TTS) synthesis that improves the robustness of large language model (LLM)-based TTS. The core idea of RALL-E is chain-of-thought (CoT) prompting, which decomposes the TTS task into simpler steps to enhance the robustness of LLM-based TTS. RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Comprehensive objective and subjective evaluations demonstrate that RALL-E significantly improves the word error rate (WER) of zero-shot TTS compared to a powerful baseline method VALL-E. RALL-E reduces the WER from 5.6% (without reranking) and 1.7% (with reranking) to 2.5% and 1.0%, respectively. Furthermore, RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from 68% to 4%. RALL-E also shows superior performance in terms of speech naturalness and speaker similarity. The contributions of this work include presenting RALL-E, a robust codec language modeling method with chain-of-thought prompting for TTS, conducting comprehensive objective and subjective evaluations, and evaluating RALL-E on particularly hard sentences. RALL-E demonstrates superior robustness in all evaluations.
Reach us at info@study.space