2024 | Jaehyeon Kim, Keon Lee, Seungjun Chung, Jaewoong Cho
CLaM-TTS is a novel approach that improves neural codec language modeling for zero-shot text-to-speech (TTS) synthesis. The method employs probabilistic residual vector quantization (RVQ) to achieve superior compression in token length and enable a language model to generate multiple tokens at once, eliminating the need for cascaded modeling. This approach allows for efficient training and inference of large language models within the TTS domain. CLaM-TTS is evaluated against state-of-the-art neural codec-based TTS models and demonstrates better or comparable performance in terms of naturalness, intelligibility, speaker similarity, and inference speed. The study also examines the impact of pretraining extent and text tokenization strategies on TTS performance. The model is trained on a large dataset of 100,000 hours of speech and text data across 11 languages. The results show that CLaM-TTS outperforms existing models in several key metrics, including subjective speech quality and objective measures such as word error rate (WER) and character error rate (CER). The model is also efficient in terms of inference speed and can generate speech in a shorter time compared to other models. The study highlights the effectiveness of the proposed method in achieving high-quality TTS synthesis with minimal input.CLaM-TTS is a novel approach that improves neural codec language modeling for zero-shot text-to-speech (TTS) synthesis. The method employs probabilistic residual vector quantization (RVQ) to achieve superior compression in token length and enable a language model to generate multiple tokens at once, eliminating the need for cascaded modeling. This approach allows for efficient training and inference of large language models within the TTS domain. CLaM-TTS is evaluated against state-of-the-art neural codec-based TTS models and demonstrates better or comparable performance in terms of naturalness, intelligibility, speaker similarity, and inference speed. The study also examines the impact of pretraining extent and text tokenization strategies on TTS performance. The model is trained on a large dataset of 100,000 hours of speech and text data across 11 languages. The results show that CLaM-TTS outperforms existing models in several key metrics, including subjective speech quality and objective measures such as word error rate (WER) and character error rate (CER). The model is also efficient in terms of inference speed and can generate speech in a shorter time compared to other models. The study highlights the effectiveness of the proposed method in achieving high-quality TTS synthesis with minimal input.