ELLA-V is a simple yet efficient language model-based zero-shot text-to-speech (TTS) framework that enables fine-grained control over synthesized audio at the phoneme level. The key innovation of ELLA-V is the interleaving of acoustic and phoneme tokens, where phoneme tokens appear before their corresponding acoustic tokens. This approach improves alignment between audio and phoneme sequences, leading to more accurate and stable results. Experimental results show that ELLA-V outperforms VALL-E in terms of accuracy and stability, particularly under both greedy and sampling-based decoding strategies. The code of ELLA-V will be open-sourced after cleanup. Audio samples are available at https://ereboas.github.io/ELLAV/.
The paper discusses the limitations of existing methods in zero-shot TTS, such as alignment issues, challenges in fine-grained control, and infinite silence generation. ELLA-V addresses these issues by introducing a generalized autoregressive (GAR) language model to generate the first layer of residual vector quantizer (RVQ) codes of a neural codec model, followed by a non-autoregressive (NAR) language model to generate the remaining RVQ codes. The core innovations of ELLA-V include inserting phoneme tokens into the corresponding positions of the acoustic sequence, computing loss only on acoustic tokens and special tokens (EOP and EOS), and introducing local advance to shift the EOP token and the next-word phoneme token a few frames ahead.
The paper also presents an ablation study to investigate the impact of global and local phoneme information on synthesized speech. The results show that ELLA-V achieves higher accuracy and more stable results under different threshold top-p for nuclear sampling. The model is capable of generating EOP and promptly truncating abnormally long phonemes, avoiding infinite silence. ELLA-V also demonstrates better performance in the zero-shot TTS cross-speaker task, with significantly lower WER compared to VALL-E. The model is more robust to decoding strategies, with less sensitivity to changes in the top-p sampling strategy. Overall, ELLA-V improves the synthesis accuracy and robustness of the language model-based TTS framework without affecting the naturalness and speaker similarity.ELLA-V is a simple yet efficient language model-based zero-shot text-to-speech (TTS) framework that enables fine-grained control over synthesized audio at the phoneme level. The key innovation of ELLA-V is the interleaving of acoustic and phoneme tokens, where phoneme tokens appear before their corresponding acoustic tokens. This approach improves alignment between audio and phoneme sequences, leading to more accurate and stable results. Experimental results show that ELLA-V outperforms VALL-E in terms of accuracy and stability, particularly under both greedy and sampling-based decoding strategies. The code of ELLA-V will be open-sourced after cleanup. Audio samples are available at https://ereboas.github.io/ELLAV/.
The paper discusses the limitations of existing methods in zero-shot TTS, such as alignment issues, challenges in fine-grained control, and infinite silence generation. ELLA-V addresses these issues by introducing a generalized autoregressive (GAR) language model to generate the first layer of residual vector quantizer (RVQ) codes of a neural codec model, followed by a non-autoregressive (NAR) language model to generate the remaining RVQ codes. The core innovations of ELLA-V include inserting phoneme tokens into the corresponding positions of the acoustic sequence, computing loss only on acoustic tokens and special tokens (EOP and EOS), and introducing local advance to shift the EOP token and the next-word phoneme token a few frames ahead.
The paper also presents an ablation study to investigate the impact of global and local phoneme information on synthesized speech. The results show that ELLA-V achieves higher accuracy and more stable results under different threshold top-p for nuclear sampling. The model is capable of generating EOP and promptly truncating abnormally long phonemes, avoiding infinite silence. ELLA-V also demonstrates better performance in the zero-shot TTS cross-speaker task, with significantly lower WER compared to VALL-E. The model is more robust to decoding strategies, with less sensitivity to changes in the top-p sampling strategy. Overall, ELLA-V improves the synthesis accuracy and robustness of the language model-based TTS framework without affecting the naturalness and speaker similarity.