25 Jan 2024 | Dong Zhang, Xin Zhang, Jun Zhan, Shimin Li, Yaqian Zhou, Xipeng Qiu
SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation
SpeechGPT-Gen is a large-scale speech language model that excels in semantic and perceptual information modeling. It employs a chain-of-information generation approach to decouple semantic and perceptual information in speech generation. The model consists of an autoregressive language model for semantic modeling and a non-autoregressive flow-matching model for perceptual modeling. SpeechGPT-Gen introduces a novel approach of infusing semantic information into the prior distribution to enhance the efficiency of flow matching. The model achieves impressive performance in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue tasks. The model is trained on large-scale speech data and demonstrates strong capabilities in capturing and modeling speech's semantic and perceptual dimensions. The model's parameters are scaled to 8 billion, enabling efficient and effective speech generation. SpeechGPT-Gen's contributions include the introduction of chain-of-information generation, the development of a large speech language model with strong semantic and perceptual modeling abilities, the improvement of flow matching efficiency through semantic prior injection, and the scaling of the speech generative model to 8 billion parameters, achieving remarkable performance in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue. The model's performance is evaluated on various metrics, including word error rate, speaker similarity, QMOS, SMOS, and ChatGPT score. The results demonstrate that SpeechGPT-Gen outperforms other baseline systems in terms of speech quality, similarity, and coherence. The model's effectiveness is further validated through experiments comparing different speech generation methods, including integrated generation, semantic-disentangled generation, and chain-of-information generation. The results show that chain-of-information generation is more effective and efficient in speech generation. Additionally, the model's performance is evaluated on different model sizes and continuous vs. discrete modeling approaches, demonstrating strong scalability and effectiveness. The model's contributions include the introduction of chain-of-information generation, the development of a large speech language model with strong semantic and perceptual modeling abilities, the improvement of flow matching efficiency through semantic prior injection, and the scaling of the speech generative model to 8 billion parameters, achieving remarkable performance in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation
SpeechGPT-Gen is a large-scale speech language model that excels in semantic and perceptual information modeling. It employs a chain-of-information generation approach to decouple semantic and perceptual information in speech generation. The model consists of an autoregressive language model for semantic modeling and a non-autoregressive flow-matching model for perceptual modeling. SpeechGPT-Gen introduces a novel approach of infusing semantic information into the prior distribution to enhance the efficiency of flow matching. The model achieves impressive performance in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue tasks. The model is trained on large-scale speech data and demonstrates strong capabilities in capturing and modeling speech's semantic and perceptual dimensions. The model's parameters are scaled to 8 billion, enabling efficient and effective speech generation. SpeechGPT-Gen's contributions include the introduction of chain-of-information generation, the development of a large speech language model with strong semantic and perceptual modeling abilities, the improvement of flow matching efficiency through semantic prior injection, and the scaling of the speech generative model to 8 billion parameters, achieving remarkable performance in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue. The model's performance is evaluated on various metrics, including word error rate, speaker similarity, QMOS, SMOS, and ChatGPT score. The results demonstrate that SpeechGPT-Gen outperforms other baseline systems in terms of speech quality, similarity, and coherence. The model's effectiveness is further validated through experiments comparing different speech generation methods, including integrated generation, semantic-disentangled generation, and chain-of-information generation. The results show that chain-of-information generation is more effective and efficient in speech generation. Additionally, the model's performance is evaluated on different model sizes and continuous vs. discrete modeling approaches, demonstrating strong scalability and effectiveness. The model's contributions include the introduction of chain-of-information generation, the development of a large speech language model with strong semantic and perceptual modeling abilities, the improvement of flow matching efficiency through semantic prior injection, and the scaling of the speech generative model to 8 billion parameters, achieving remarkable performance in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.