This paper introduces MELLE, a novel continuous-valued tokens-based language modeling approach for text-to-speech (TTS) synthesis. Unlike traditional methods that rely on vector quantization, MELLE directly generates continuous mel-spectrogram frames from text conditions, bypassing the need for discrete code generation. This approach avoids the fidelity issues associated with vector-quantized tokens, which are typically used in audio compression. MELLE uses a single-stage decoder-only model and incorporates variational inference to enhance sampling diversity and model robustness.
The key innovations of MELLE include the use of regression loss with a spectrogram flux loss function to model continuous-valued tokens, and a latent sampling module derived from variational inference to improve output diversity. MELLE is evaluated on large-scale datasets such as Libriheavy and LibriSpeech, demonstrating superior performance across multiple metrics compared to existing models like VALL-E and its variants. It achieves a 47.9% relative reduction in WER compared to VALL-E and an 8.1% reduction compared to VALL-E 2. Subjective evaluations also show that MELLE is more favorably received by human listeners, achieving comparable performance to the ground truth in terms of MOS and CMOS, and even higher SMOS.
MELLE's architecture includes a pre-net for text and mel-spectrogram conversion, an autoregressive Transformer decoder, a latent sampling module for generating latent embeddings, a stop prediction layer, and a convolutional post-net for spectrogram refinement. The model is trained using four loss functions: regression loss, KL divergence loss, spectrogram flux loss, and binary cross entropy loss for stop prediction. The training process is efficient and straightforward, avoiding the complex hierarchical structure of VALL-E.
Inference is performed using a single forward pass, with the reduction factor allowing the model to generate multiple mel-spectrogram frames at once, improving efficiency. MELLE's in-context learning capability enables it to generate high-fidelity, natural-sounding speech for unseen speakers without fine-tuning. The model's performance is evaluated on both objective and subjective metrics, showing significant improvements in robustness, speaker similarity, and naturalness compared to existing methods. The results demonstrate that MELLE offers a more streamlined and efficient paradigm for autoregressive speech synthesis without vector quantization.This paper introduces MELLE, a novel continuous-valued tokens-based language modeling approach for text-to-speech (TTS) synthesis. Unlike traditional methods that rely on vector quantization, MELLE directly generates continuous mel-spectrogram frames from text conditions, bypassing the need for discrete code generation. This approach avoids the fidelity issues associated with vector-quantized tokens, which are typically used in audio compression. MELLE uses a single-stage decoder-only model and incorporates variational inference to enhance sampling diversity and model robustness.
The key innovations of MELLE include the use of regression loss with a spectrogram flux loss function to model continuous-valued tokens, and a latent sampling module derived from variational inference to improve output diversity. MELLE is evaluated on large-scale datasets such as Libriheavy and LibriSpeech, demonstrating superior performance across multiple metrics compared to existing models like VALL-E and its variants. It achieves a 47.9% relative reduction in WER compared to VALL-E and an 8.1% reduction compared to VALL-E 2. Subjective evaluations also show that MELLE is more favorably received by human listeners, achieving comparable performance to the ground truth in terms of MOS and CMOS, and even higher SMOS.
MELLE's architecture includes a pre-net for text and mel-spectrogram conversion, an autoregressive Transformer decoder, a latent sampling module for generating latent embeddings, a stop prediction layer, and a convolutional post-net for spectrogram refinement. The model is trained using four loss functions: regression loss, KL divergence loss, spectrogram flux loss, and binary cross entropy loss for stop prediction. The training process is efficient and straightforward, avoiding the complex hierarchical structure of VALL-E.
Inference is performed using a single forward pass, with the reduction factor allowing the model to generate multiple mel-spectrogram frames at once, improving efficiency. MELLE's in-context learning capability enables it to generate high-fidelity, natural-sounding speech for unseen speakers without fine-tuning. The model's performance is evaluated on both objective and subjective metrics, showing significant improvements in robustness, speaker similarity, and naturalness compared to existing methods. The results demonstrate that MELLE offers a more streamlined and efficient paradigm for autoregressive speech synthesis without vector quantization.