Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

4 Jun 2024 | Seed Team, ByteDance*
Seed-TTS is a family of large-scale autoregressive text-to-speech (TTS) models designed to generate speech that is nearly indistinguishable from human speech. The model excels in speech in-context learning (ICL), achieving high speaker similarity and naturalness in both objective and subjective evaluations. Seed-TTS offers superior controllability over various speech attributes, such as emotion, and can generate highly expressive and diverse speech for speakers in the wild. The paper introduces two novel extensions: self-distillation for speech factorization and reinforcement learning (RL) for enhancing model robustness, speaker similarity, and controllability. Additionally, a non-autoregressive (NAR) variant, Seed-TTS$_{DT}$, is proposed, which uses a fully diffusion-based architecture and performs end-to-end speech generation without pre-estimated phoneme durations. This variant achieves comparable performance to language model-based variants and demonstrates effectiveness in speech editing. The paper also discusses potential applications and limitations of Seed-TTS, emphasizing the need for careful consideration of its societal impact.Seed-TTS is a family of large-scale autoregressive text-to-speech (TTS) models designed to generate speech that is nearly indistinguishable from human speech. The model excels in speech in-context learning (ICL), achieving high speaker similarity and naturalness in both objective and subjective evaluations. Seed-TTS offers superior controllability over various speech attributes, such as emotion, and can generate highly expressive and diverse speech for speakers in the wild. The paper introduces two novel extensions: self-distillation for speech factorization and reinforcement learning (RL) for enhancing model robustness, speaker similarity, and controllability. Additionally, a non-autoregressive (NAR) variant, Seed-TTS$_{DT}$, is proposed, which uses a fully diffusion-based architecture and performs end-to-end speech generation without pre-estimated phoneme durations. This variant achieves comparable performance to language model-based variants and demonstrates effectiveness in speech editing. The paper also discusses potential applications and limitations of Seed-TTS, emphasizing the need for careful consideration of its societal impact.
Reach us at info@study.space