FlashSpeech is an efficient zero-shot speech synthesis system that significantly reduces inference time while maintaining high audio quality and speaker similarity. The system is built on a latent consistency model (LCM) and employs a novel adversarial consistency training approach, allowing it to be trained from scratch without the need for a pre-trained diffusion model. A new prosody generator module enhances the diversity of prosody, making the speech sound more natural. FlashSpeech can generate speech in one or two sampling steps, achieving high audio quality and similarity to the audio prompt. Experimental results show that FlashSpeech is approximately 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. The system is versatile, efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples are available at https://flashspeech.github.io/. The system's efficiency is achieved through the use of a latent consistency model and adversarial consistency training, which significantly reduces computational costs and training time. FlashSpeech demonstrates superior performance in zero-shot text-to-speech, voice conversion, and speech editing tasks, making it an efficient and effective solution for large-scale speech synthesis.FlashSpeech is an efficient zero-shot speech synthesis system that significantly reduces inference time while maintaining high audio quality and speaker similarity. The system is built on a latent consistency model (LCM) and employs a novel adversarial consistency training approach, allowing it to be trained from scratch without the need for a pre-trained diffusion model. A new prosody generator module enhances the diversity of prosody, making the speech sound more natural. FlashSpeech can generate speech in one or two sampling steps, achieving high audio quality and similarity to the audio prompt. Experimental results show that FlashSpeech is approximately 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. The system is versatile, efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples are available at https://flashspeech.github.io/. The system's efficiency is achieved through the use of a latent consistency model and adversarial consistency training, which significantly reduces computational costs and training time. FlashSpeech demonstrates superior performance in zero-shot text-to-speech, voice conversion, and speech editing tasks, making it an efficient and effective solution for large-scale speech synthesis.