25 Jan 2024 | Dong Zhang Xin Zhang Jun Zhan Shimin Li Yaqian Zhou Xipeng Qiu
**Abstract:**
The paper introduces Chain-of-Information Generation (CoIG), a method that decouples semantic and perceptual information in large-scale speech generation. Building on CoIG, the authors develop SpeechGPT-Gen, an 8-billion-parameter Speech Large Language Model (SLLM) that efficiently models both semantic and perceptual information. SpeechGPT-Gen consists of an autoregressive model for semantic modeling and a non-autoregressive model using flow matching for perceptual modeling. The authors also propose a novel approach of infusing semantic information into the prior distribution to enhance the efficiency of flow matching. Extensive experiments demonstrate that SpeechGPT-Gen excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue tasks, highlighting the effectiveness of CoIG in capturing and modeling speech's semantic and perceptual dimensions.
**Contributions:**
1. Introduction of Chain-of-Information Generation (CoIG) for disentangled semantic and perceptual modeling in large-scale speech generation.
2. Development of SpeechGPT-Gen, a large SLLM with strong semantic and perceptual modeling capabilities.
3. Proposal of improving flow matching efficiency by injecting semantic information into the prior distribution.
4. Scaling up the model parameters to 8 billion, achieving remarkable performance in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
**Related Work:**
The paper reviews existing approaches in large-scale speech generation, including integrated and semantic-disentangled generation methods, and discusses the limitations of current models in terms of redundancy and inefficiency.
**SpeechGPT-Gen Architecture:**
- **SpeechTokenizer:** A Residual Vector Quantization (RVQ)-based method that hierarchically disentangles semantic and perceptual information.
- **LLaMA2-7B-Chat:** Pre-trained model for semantic modeling.
- **Flow Matching for Perceptual Modeling:** Utilizes continuous normalizing flows to model the transformation from a simple prior distribution to a complex data distribution.
**Experiments:**
- **Zero-shot Text-to-Speech:** Achieves low Word Error Rate (WER) and high speaker similarity.
- **Zero-shot Voice Conversion:** Demonstrates low WER and high speaker similarity.
- **Speech-to-Speech Dialogue:** Shows high ChatGPT scores for response quality.
**Analysis:**
- **CoIG Effectiveness and Efficiency:** Compared with integrated and semantic-disentangled generation methods, CoIG shows faster training convergence and better downstream performance.
- **Semantic Prior in Flow Matching:** Enhances the effectiveness of flow matching by making the inference process more efficient.
- **Scalability:** Demonstrates strong scalability with larger model sizes and more training data.
**Conclusion:**
SpeechGPT-Gen addresses the redundancy in current SLLMs by decoupling semantic and perceptual information, achieving superior performance in zero-shot text-to-speech, zero-shot voice conversion, and speech**Abstract:**
The paper introduces Chain-of-Information Generation (CoIG), a method that decouples semantic and perceptual information in large-scale speech generation. Building on CoIG, the authors develop SpeechGPT-Gen, an 8-billion-parameter Speech Large Language Model (SLLM) that efficiently models both semantic and perceptual information. SpeechGPT-Gen consists of an autoregressive model for semantic modeling and a non-autoregressive model using flow matching for perceptual modeling. The authors also propose a novel approach of infusing semantic information into the prior distribution to enhance the efficiency of flow matching. Extensive experiments demonstrate that SpeechGPT-Gen excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue tasks, highlighting the effectiveness of CoIG in capturing and modeling speech's semantic and perceptual dimensions.
**Contributions:**
1. Introduction of Chain-of-Information Generation (CoIG) for disentangled semantic and perceptual modeling in large-scale speech generation.
2. Development of SpeechGPT-Gen, a large SLLM with strong semantic and perceptual modeling capabilities.
3. Proposal of improving flow matching efficiency by injecting semantic information into the prior distribution.
4. Scaling up the model parameters to 8 billion, achieving remarkable performance in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
**Related Work:**
The paper reviews existing approaches in large-scale speech generation, including integrated and semantic-disentangled generation methods, and discusses the limitations of current models in terms of redundancy and inefficiency.
**SpeechGPT-Gen Architecture:**
- **SpeechTokenizer:** A Residual Vector Quantization (RVQ)-based method that hierarchically disentangles semantic and perceptual information.
- **LLaMA2-7B-Chat:** Pre-trained model for semantic modeling.
- **Flow Matching for Perceptual Modeling:** Utilizes continuous normalizing flows to model the transformation from a simple prior distribution to a complex data distribution.
**Experiments:**
- **Zero-shot Text-to-Speech:** Achieves low Word Error Rate (WER) and high speaker similarity.
- **Zero-shot Voice Conversion:** Demonstrates low WER and high speaker similarity.
- **Speech-to-Speech Dialogue:** Shows high ChatGPT scores for response quality.
**Analysis:**
- **CoIG Effectiveness and Efficiency:** Compared with integrated and semantic-disentangled generation methods, CoIG shows faster training convergence and better downstream performance.
- **Semantic Prior in Flow Matching:** Enhances the effectiveness of flow matching by making the inference process more efficient.
- **Scalability:** Demonstrates strong scalability with larger model sizes and more training data.
**Conclusion:**
SpeechGPT-Gen addresses the redundancy in current SLLMs by decoupling semantic and perceptual information, achieving superior performance in zero-shot text-to-speech, zero-shot voice conversion, and speech