The paper introduces ELLA-V, a novel zero-shot text-to-speech (TTS) framework that addresses the limitations of existing methods like VALL-E. ELLA-V aims to improve the accuracy and stability of synthesized speech by enabling fine-grained control over the phoneme level. The key innovation is the interleaving of acoustic and phoneme tokens, where phoneme tokens are placed before their corresponding acoustic tokens. This approach helps the language model capture the alignment between audio and phoneme modalities more effectively. ELLA-V employs a Generalized Autoregressive (GAR) language model for the first layer of residual vector quantizer (RVQ) codes and a Non-Autoregressive (NAR) language model for subsequent layers. The GAR model is trained to predict the first layer of RVQ codes, while the NAR model predicts the codes for the remaining layers. The training objective is to maximize the likelihood of acoustic tokens and special tokens like EndOfPhone (EOP) and EndOfSentence (EOS). During inference, ELLA-V uses a sampling-based decoding strategy for the GAR model and a greedy decoding approach for the NAR model. Experimental results show that ELLA-V outperforms VALL-E in terms of accuracy and stability, achieving a Word Error Rate (WER) of 2.28% on the LibriSpeech test-clean set. ELLA-V also demonstrates better robustness in challenging synthesis tasks, such as cross-speaker synthesis, and handles infinite silence generation more effectively. The code for ELLA-V will be open-sourced after cleanups, and audio samples are available online.The paper introduces ELLA-V, a novel zero-shot text-to-speech (TTS) framework that addresses the limitations of existing methods like VALL-E. ELLA-V aims to improve the accuracy and stability of synthesized speech by enabling fine-grained control over the phoneme level. The key innovation is the interleaving of acoustic and phoneme tokens, where phoneme tokens are placed before their corresponding acoustic tokens. This approach helps the language model capture the alignment between audio and phoneme modalities more effectively. ELLA-V employs a Generalized Autoregressive (GAR) language model for the first layer of residual vector quantizer (RVQ) codes and a Non-Autoregressive (NAR) language model for subsequent layers. The GAR model is trained to predict the first layer of RVQ codes, while the NAR model predicts the codes for the remaining layers. The training objective is to maximize the likelihood of acoustic tokens and special tokens like EndOfPhone (EOP) and EndOfSentence (EOS). During inference, ELLA-V uses a sampling-based decoding strategy for the GAR model and a greedy decoding approach for the NAR model. Experimental results show that ELLA-V outperforms VALL-E in terms of accuracy and stability, achieving a Word Error Rate (WER) of 2.28% on the LibriSpeech test-clean set. ELLA-V also demonstrates better robustness in challenging synthesis tasks, such as cross-speaker synthesis, and handles infinite silence generation more effectively. The code for ELLA-V will be open-sourced after cleanups, and audio samples are available online.