CLAM-TTS: IMPROVING NEURAL CODEC LANGUAGE MODELING FOR ZERO-SHOT TEXT-TO-SPEECH

CLAM-TTS: IMPROVING NEURAL CODEC LANGUAGE MODELING FOR ZERO-SHOT TEXT-TO-SPEECH

3 Apr 2024 | Jaehyeon Kim, Keon Lee, Seungjun Chung, Jaewoong Cho
CLaM-TTS is an advanced system that leverages neural audio codecs and large language models to improve zero-shot Text-to-Speech (TTS) synthesis. The system employs probabilistic residual vector quantization to achieve superior compression in token length and enable a language model to generate multiple tokens simultaneously, eliminating the need for cascaded modeling. The core of CLaM-TTS involves a Mel-VAE that encodes mel-spectrograms into discrete latent representations, followed by a latent language model that predicts latent variables and converts them into discrete audio tokens. The experimental results demonstrate that CLaM-TTS outperforms or matches state-of-the-art neural codec-based TTS models in terms of naturalness, intelligibility, speaker similarity, and inference speed. The paper also investigates the impact of pretraining extent and text tokenization strategies on TTS performance.CLaM-TTS is an advanced system that leverages neural audio codecs and large language models to improve zero-shot Text-to-Speech (TTS) synthesis. The system employs probabilistic residual vector quantization to achieve superior compression in token length and enable a language model to generate multiple tokens simultaneously, eliminating the need for cascaded modeling. The core of CLaM-TTS involves a Mel-VAE that encodes mel-spectrograms into discrete latent representations, followed by a latent language model that predicts latent variables and converts them into discrete audio tokens. The experimental results demonstrate that CLaM-TTS outperforms or matches state-of-the-art neural codec-based TTS models in terms of naturalness, intelligibility, speaker similarity, and inference speed. The paper also investigates the impact of pretraining extent and text tokenization strategies on TTS performance.
Reach us at info@study.space
[slides and audio] CLaM-TTS%3A Improving Neural Codec Language Model for Zero-Shot Text-to-Speech