SpeechAlign: Aligning Speech Generation to Human Preferences

SpeechAlign: Aligning Speech Generation to Human Preferences

8 Apr 2024 | Dong Zhang, Zhaowei Li, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu
SpeechAlign is a method to align speech language models to human preferences by learning from human feedback. The paper addresses the issue of distribution gaps in codec language models, which can lead to discrepancies between training and inference phases, negatively affecting performance. SpeechAlign introduces an iterative self-improvement strategy that constructs a preference codec dataset contrasting golden codec tokens against synthetic tokens, followed by preference optimization to improve the codec language model. This cycle of improvement is carried out iteratively to steadily convert weak models to strong ones. Through both subjective and objective evaluations, the paper shows that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model. Additionally, SpeechAlign exhibits robust generalization capabilities and works for smaller models. The paper also explores various preference optimization strategies, including Chain-of-Hindsight, Direct Preference Optimization, RLHF-PPO, and Best-of-N Sampling. The results show that preference optimization can enhance the accuracy of content modeling and improve the effectiveness of timbre modeling. SpeechAlign is effective for both large and small models and can be generalized to unseen speakers. The paper also discusses the limitations of current methods and future work in this area.SpeechAlign is a method to align speech language models to human preferences by learning from human feedback. The paper addresses the issue of distribution gaps in codec language models, which can lead to discrepancies between training and inference phases, negatively affecting performance. SpeechAlign introduces an iterative self-improvement strategy that constructs a preference codec dataset contrasting golden codec tokens against synthetic tokens, followed by preference optimization to improve the codec language model. This cycle of improvement is carried out iteratively to steadily convert weak models to strong ones. Through both subjective and objective evaluations, the paper shows that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model. Additionally, SpeechAlign exhibits robust generalization capabilities and works for smaller models. The paper also explores various preference optimization strategies, including Chain-of-Hindsight, Direct Preference Optimization, RLHF-PPO, and Best-of-N Sampling. The results show that preference optimization can enhance the accuracy of content modeling and improve the effectiveness of timbre modeling. SpeechAlign is effective for both large and small models and can be generalized to unseen speakers. The paper also discusses the limitations of current methods and future work in this area.
Reach us at info@study.space