The paper introduces MELLE, a novel continuous-valued tokens-based language modeling approach for text-to-speech synthesis (TTS). Unlike traditional methods that use discrete vector quantization, MELLE directly predicts continuous mel-spectrogram frames from text and acoustic prompts using a single-stage decoder-only model. This approach bypasses the need for vector quantization, which often sacrifices fidelity compared to mel-spectrograms. Key contributions include:
1. **Training Objective**: Instead of cross-entropy loss, MELLE uses regression loss with a spectrogram flux loss function to model the probability distribution of continuous-valued tokens.
2. **Latent Sampling Module**: Incorporated into MELLE to enhance output diversity and robustness through variational inference.
3. **Efficiency**: MELLE can predict multiple frames at once, reducing inference time and computational complexity.
Experiments on the Libriheavy and LibriSpeech datasets show that MELLE outperforms existing methods like VALL-E and its variants in terms of objective metrics such as word error rate (WER) and subjective metrics such as mean opinion score (MOS) and speaker similarity (SIM). MELLE also demonstrates superior robustness and naturalness in synthesized speech, achieving comparable or better performance in various metrics. The paper discusses limitations and broader impacts, including the potential for misuse and the need for further research in multi-lingual settings and other continuous representations.The paper introduces MELLE, a novel continuous-valued tokens-based language modeling approach for text-to-speech synthesis (TTS). Unlike traditional methods that use discrete vector quantization, MELLE directly predicts continuous mel-spectrogram frames from text and acoustic prompts using a single-stage decoder-only model. This approach bypasses the need for vector quantization, which often sacrifices fidelity compared to mel-spectrograms. Key contributions include:
1. **Training Objective**: Instead of cross-entropy loss, MELLE uses regression loss with a spectrogram flux loss function to model the probability distribution of continuous-valued tokens.
2. **Latent Sampling Module**: Incorporated into MELLE to enhance output diversity and robustness through variational inference.
3. **Efficiency**: MELLE can predict multiple frames at once, reducing inference time and computational complexity.
Experiments on the Libriheavy and LibriSpeech datasets show that MELLE outperforms existing methods like VALL-E and its variants in terms of objective metrics such as word error rate (WER) and subjective metrics such as mean opinion score (MOS) and speaker similarity (SIM). MELLE also demonstrates superior robustness and naturalness in synthesized speech, achieving comparable or better performance in various metrics. The paper discusses limitations and broader impacts, including the potential for misuse and the need for further research in multi-lingual settings and other continuous representations.