SpeechAlign: Aligning Speech Generation to Human Preferences

SpeechAlign: Aligning Speech Generation to Human Preferences

8 Apr 2024 | Dong Zhang*, Zhaowei Li*, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou†, Xipeng Qiu†
**SpeechAlign: Aligning Speech Generation to Human Preferences** This paper addresses the gap between the training and inference phases in neural codec language models, which often leads to discrepancies and negatively affects performance. The authors propose SpeechAlign, an iterative self-improvement strategy that aligns speech language models with human preferences. SpeechAlign involves constructing a preference codec dataset by contrasting golden codec tokens with synthetic tokens, followed by preference optimization to improve the codec language model. This cycle of improvement is repeated iteratively to enhance the model's performance. The effectiveness of SpeechAlign is demonstrated through both subjective and objective evaluations, showing that it can bridge the distribution gap and facilitate continuous self-improvement of the speech language model. Additionally, SpeechAlign exhibits robust generalization capabilities and works effectively even with smaller models. The code and models are available at https://github.com/0notation/SpeechGPT. - **Background**: The paper introduces the concept of codec language models and the distribution gap between golden and synthetic codec tokens. - **Preliminary Analysis**: It analyzes the distribution gap and its impact on the performance of the NAR model. - **SpeechAlign**: The core method is described, including the construction of the preference codec dataset and various preference optimization strategies (Chain-of-Hindsight, Direct Preference Optimization, RLHF-PPO, and Best-of-N Sampling). - **Experiments**: The paper reports on the setup, evaluation metrics, and main results, showing that SpeechAlign significantly improves speech generation quality and speaker similarity. - **Ablation Studies**: It explores the effect of different training data sizes, compares with continued supervised fine-tuning, and demonstrates that SpeechAlign works effectively with small models. - **Conclusion**: The paper concludes by highlighting the effectiveness of SpeechAlign in aligning speech language models with human preferences and its potential for continuous self-improvement. - **Fine-grained Reward Signals from Real-World Human Preferences**: The paper suggests that capturing more detailed aspects of human preferences could enhance speech generation capabilities. - **Preference Optimization of the NAR Models**: Applying preference optimization to calibrate the output distribution of NAR models is recommended for further exploration.**SpeechAlign: Aligning Speech Generation to Human Preferences** This paper addresses the gap between the training and inference phases in neural codec language models, which often leads to discrepancies and negatively affects performance. The authors propose SpeechAlign, an iterative self-improvement strategy that aligns speech language models with human preferences. SpeechAlign involves constructing a preference codec dataset by contrasting golden codec tokens with synthetic tokens, followed by preference optimization to improve the codec language model. This cycle of improvement is repeated iteratively to enhance the model's performance. The effectiveness of SpeechAlign is demonstrated through both subjective and objective evaluations, showing that it can bridge the distribution gap and facilitate continuous self-improvement of the speech language model. Additionally, SpeechAlign exhibits robust generalization capabilities and works effectively even with smaller models. The code and models are available at https://github.com/0notation/SpeechGPT. - **Background**: The paper introduces the concept of codec language models and the distribution gap between golden and synthetic codec tokens. - **Preliminary Analysis**: It analyzes the distribution gap and its impact on the performance of the NAR model. - **SpeechAlign**: The core method is described, including the construction of the preference codec dataset and various preference optimization strategies (Chain-of-Hindsight, Direct Preference Optimization, RLHF-PPO, and Best-of-N Sampling). - **Experiments**: The paper reports on the setup, evaluation metrics, and main results, showing that SpeechAlign significantly improves speech generation quality and speaker similarity. - **Ablation Studies**: It explores the effect of different training data sizes, compares with continued supervised fine-tuning, and demonstrates that SpeechAlign works effectively with small models. - **Conclusion**: The paper concludes by highlighting the effectiveness of SpeechAlign in aligning speech language models with human preferences and its potential for continuous self-improvement. - **Fine-grained Reward Signals from Real-World Human Preferences**: The paper suggests that capturing more detailed aspects of human preferences could enhance speech generation capabilities. - **Preference Optimization of the NAR Models**: Applying preference optimization to calibrate the output distribution of NAR models is recommended for further exploration.
Reach us at info@study.space