Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

4 Jun 2024 | Seed Team, ByteDance
Seed-TTS is a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. It serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, Seed-TTS achieves even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild. Additionally, the paper proposes a self-distillation method for speech factorization and a reinforcement learning approach to enhance model robustness, speaker similarity, and controllability. A non-autoregressive (NAR) variant of Seed-TTS, named Seed-TTS_DiT, is introduced, which utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, Seed-TTS_DiT does not depend on pre-estimated phoneme durations and performs speech generation through end-to-end processing. The variant achieves comparable performance to the language model-based variant and showcases its effectiveness in speech editing. Seed-TTS is an autoregressive transformer-based model consisting of four main building blocks: a speech tokenizer, a token language model, a token diffusion model, and an acoustic vocoder. It is trained on large amounts of data to enable strong generalization and emergent abilities. The model undergoes three training stages: pre-training, fine-tuning, and post-training. The pre-training stage aims to maximize scenario and speaker coverage while establishing a robust backbone for general speech modeling. The fine-tuning stage consists of speaker fine-tuning and instruction fine-tuning. Post-training is conducted through RL, which holistically improves the model. The paper presents experiments on zero-shot in-context learning, speaker fine-tuning, and emotion control. The results show that Seed-TTS achieves performance closely matching real human speech for both English and Mandarin with CMOS scores of -0.07 and -0.08, respectively. The model also demonstrates superior performance in speech understanding and generation. The paper also discusses the model's applications, limitations, and safety considerations, highlighting the potential for using synthetic data in the development of speech understanding models. The model's capabilities and limitations give rise to significant and novel challenges in multimedia and safety applications. The key contributions of the paper include the introduction of Seed-TTS, a novel self-distillation extension for timbre disentanglement, a novel RL-based post-training extension, and a novel fully diffusion-based variant of Seed-TTS.Seed-TTS is a family of large-scale autoregressive text-to-speech (TTS) models capable of generating speech that is virtually indistinguishable from human speech. It serves as a foundation model for speech generation and excels in speech in-context learning, achieving performance in speaker similarity and naturalness that matches ground truth human speech in both objective and subjective evaluations. With fine-tuning, Seed-TTS achieves even higher subjective scores across these metrics. Seed-TTS offers superior controllability over various speech attributes such as emotion and is capable of generating highly expressive and diverse speech for speakers in the wild. Additionally, the paper proposes a self-distillation method for speech factorization and a reinforcement learning approach to enhance model robustness, speaker similarity, and controllability. A non-autoregressive (NAR) variant of Seed-TTS, named Seed-TTS_DiT, is introduced, which utilizes a fully diffusion-based architecture. Unlike previous NAR-based TTS systems, Seed-TTS_DiT does not depend on pre-estimated phoneme durations and performs speech generation through end-to-end processing. The variant achieves comparable performance to the language model-based variant and showcases its effectiveness in speech editing. Seed-TTS is an autoregressive transformer-based model consisting of four main building blocks: a speech tokenizer, a token language model, a token diffusion model, and an acoustic vocoder. It is trained on large amounts of data to enable strong generalization and emergent abilities. The model undergoes three training stages: pre-training, fine-tuning, and post-training. The pre-training stage aims to maximize scenario and speaker coverage while establishing a robust backbone for general speech modeling. The fine-tuning stage consists of speaker fine-tuning and instruction fine-tuning. Post-training is conducted through RL, which holistically improves the model. The paper presents experiments on zero-shot in-context learning, speaker fine-tuning, and emotion control. The results show that Seed-TTS achieves performance closely matching real human speech for both English and Mandarin with CMOS scores of -0.07 and -0.08, respectively. The model also demonstrates superior performance in speech understanding and generation. The paper also discusses the model's applications, limitations, and safety considerations, highlighting the potential for using synthetic data in the development of speech understanding models. The model's capabilities and limitations give rise to significant and novel challenges in multimedia and safety applications. The key contributions of the paper include the introduction of Seed-TTS, a novel self-distillation extension for timbre disentanglement, a novel RL-based post-training extension, and a novel fully diffusion-based variant of Seed-TTS.
Reach us at info@study.space