13 Jun 2024 | Xuan Zhang, Chao Du, Tianyu Pang, Qian Liu, Wei Gao, Min Lin
Chain of Preference Optimization (CPO) improves the reasoning ability of large language models (LLMs) by leveraging the preference information generated during tree-of-thought (ToT) search. Unlike previous methods that focus on the complete reasoning path, CPO utilizes the preferences over thoughts generated at each reasoning step, which are often discarded in prior works. This approach allows LLMs to learn from the inherent preference information in the tree-search process, leading to improved reasoning performance without increasing inference latency. CPO constructs paired preference thoughts based on the search tree of ToT and trains LLMs to align with these preferences using the Direct Preference Optimization (DPO) algorithm. Extensive experiments on seven datasets using LLaMA and Mistral as base models show that CPO significantly improves LLM performance in solving complex tasks such as question answering, fact verification, and arithmetic reasoning, achieving an average accuracy improvement of up to 4.3% compared to the base models. CPO also achieves comparable or even superior performance to the ToT method, which requires significantly longer inference time. The method is efficient and does not rely on human-annotated data, allowing LLMs to self-learn from their own feedback. CPO constructs feedback in a chain fashion, focusing on reasoning steps, which is an aspect overlooked by prior works. The method is effective in improving LLM reasoning ability while maintaining efficiency. CPO is also robust to different data settings and demonstrates performance improvements across various data types. The method is also effective in reducing the impact of dispreferred thoughts on model performance. CPO is a novel method that leverages the self-reasoning process to enhance the reasoning ability of LLMs. The method is efficient and does not require human annotation, making it suitable for a wide range of applications. CPO is a promising approach for improving the reasoning ability of LLMs and has the potential to be applied in various domains.Chain of Preference Optimization (CPO) improves the reasoning ability of large language models (LLMs) by leveraging the preference information generated during tree-of-thought (ToT) search. Unlike previous methods that focus on the complete reasoning path, CPO utilizes the preferences over thoughts generated at each reasoning step, which are often discarded in prior works. This approach allows LLMs to learn from the inherent preference information in the tree-search process, leading to improved reasoning performance without increasing inference latency. CPO constructs paired preference thoughts based on the search tree of ToT and trains LLMs to align with these preferences using the Direct Preference Optimization (DPO) algorithm. Extensive experiments on seven datasets using LLaMA and Mistral as base models show that CPO significantly improves LLM performance in solving complex tasks such as question answering, fact verification, and arithmetic reasoning, achieving an average accuracy improvement of up to 4.3% compared to the base models. CPO also achieves comparable or even superior performance to the ToT method, which requires significantly longer inference time. The method is efficient and does not rely on human-annotated data, allowing LLMs to self-learn from their own feedback. CPO constructs feedback in a chain fashion, focusing on reasoning steps, which is an aspect overlooked by prior works. The method is effective in improving LLM reasoning ability while maintaining efficiency. CPO is also robust to different data settings and demonstrates performance improvements across various data types. The method is also effective in reducing the impact of dispreferred thoughts on model performance. CPO is a novel method that leverages the self-reasoning process to enhance the reasoning ability of LLMs. The method is efficient and does not require human annotation, making it suitable for a wide range of applications. CPO is a promising approach for improving the reasoning ability of LLMs and has the potential to be applied in various domains.