This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that achieves human-level naturalness and state-of-the-art speaker similarity and intelligibility. E2 TTS converts text input into a character sequence with filler tokens, and a flow-matching-based mel spectrogram generator is trained using the audio infilling task. Unlike previous methods, E2 TTS does not require additional components or complex techniques, making it simple yet effective. The system's simplicity allows for flexible input representation, and several variants are proposed to improve usability during inference. E2 TTS outperforms or matches previous works, including Voicebox and NaturalSpeech 3, in terms of zero-shot TTS capabilities. The paper also discusses the relationship between E2 TTS and Voicebox, highlighting how E2 TTS simplifies the model by eliminating the need for a grapheme-to-phoneme converter, phoneme aligner, and duration model. Experimental results demonstrate the robustness and scalability of E2 TTS, showing superior performance in both objective and subjective evaluations.This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that achieves human-level naturalness and state-of-the-art speaker similarity and intelligibility. E2 TTS converts text input into a character sequence with filler tokens, and a flow-matching-based mel spectrogram generator is trained using the audio infilling task. Unlike previous methods, E2 TTS does not require additional components or complex techniques, making it simple yet effective. The system's simplicity allows for flexible input representation, and several variants are proposed to improve usability during inference. E2 TTS outperforms or matches previous works, including Voicebox and NaturalSpeech 3, in terms of zero-shot TTS capabilities. The paper also discusses the relationship between E2 TTS and Voicebox, highlighting how E2 TTS simplifies the model by eliminating the need for a grapheme-to-phoneme converter, phoneme aligner, and duration model. Experimental results demonstrate the robustness and scalability of E2 TTS, showing superior performance in both objective and subjective evaluations.