3 Apr 2024 | Jaehyeon Kim, Keon Lee, Seungjun Chung, Jaewoong Cho
CLaM-TTS is an advanced system that leverages neural audio codecs and large language models to improve zero-shot Text-to-Speech (TTS) synthesis. The system employs probabilistic residual vector quantization to achieve superior compression in token length and enable a language model to generate multiple tokens simultaneously, eliminating the need for cascaded modeling. The core of CLaM-TTS involves a Mel-VAE that encodes mel-spectrograms into discrete latent representations, followed by a latent language model that predicts latent variables and converts them into discrete audio tokens. The experimental results demonstrate that CLaM-TTS outperforms or matches state-of-the-art neural codec-based TTS models in terms of naturalness, intelligibility, speaker similarity, and inference speed. The paper also investigates the impact of pretraining extent and text tokenization strategies on TTS performance.CLaM-TTS is an advanced system that leverages neural audio codecs and large language models to improve zero-shot Text-to-Speech (TTS) synthesis. The system employs probabilistic residual vector quantization to achieve superior compression in token length and enable a language model to generate multiple tokens simultaneously, eliminating the need for cascaded modeling. The core of CLaM-TTS involves a Mel-VAE that encodes mel-spectrograms into discrete latent representations, followed by a latent language model that predicts latent variables and converts them into discrete audio tokens. The experimental results demonstrate that CLaM-TTS outperforms or matches state-of-the-art neural codec-based TTS models in terms of naturalness, intelligibility, speaker similarity, and inference speed. The paper also investigates the impact of pretraining extent and text tokenization strategies on TTS performance.