[slides and audio] Chain of Preference Optimization%3A Improving Chain-of-Thought Reasoning in LLMs

The paper introduces Chain of Preference Optimization (CPO), a method that enhances the reasoning capabilities of large language models (LLMs) by leveraging the tree-of-thought (ToT) method. ToT explores multiple reasoning paths through tree-searching, but it is computationally expensive. CPO fine-tunes LLMs using the search tree constructed by ToT, aligning each step of the chain-of-thought (CoT) reasoning paths with those of ToT. This approach avoids the high inference complexity of ToT while achieving similar or better performance. CPO constructs paired preference thoughts at each reasoning step, where preferred and dispreferred thoughts are identified based on their inclusion in the final paths chosen by ToT. The LLMs are then trained using direct preference optimization (DPO) to align with these preferences. Extensive experiments on various datasets show that CPO significantly improves LLM performance in question answering, fact verification, and arithmetic reasoning tasks, achieving an average accuracy improvement of 4.3% compared to base models. CPO also maintains low latency, making it a practical and effective solution for enhancing LLMs' reasoning abilities.The paper introduces Chain of Preference Optimization (CPO), a method that enhances the reasoning capabilities of large language models (LLMs) by leveraging the tree-of-thought (ToT) method. ToT explores multiple reasoning paths through tree-searching, but it is computationally expensive. CPO fine-tunes LLMs using the search tree constructed by ToT, aligning each step of the chain-of-thought (CoT) reasoning paths with those of ToT. This approach avoids the high inference complexity of ToT while achieving similar or better performance. CPO constructs paired preference thoughts at each reasoning step, where preferred and dispreferred thoughts are identified based on their inclusion in the final paths chosen by ToT. The LLMs are then trained using direct preference optimization (DPO) to align with these preferences. Extensive experiments on various datasets show that CPO significantly improves LLM performance in question answering, fact verification, and arithmetic reasoning tasks, achieving an average accuracy improvement of 4.3% compared to base models. CPO also maintains low latency, making it a practical and effective solution for enhancing LLMs' reasoning abilities.

Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

13 Jun 2024 | Xuan Zhang*12, Chao Du1†, Tianyu Pang1, Qian Liu1, Wei Gao2, Min Lin1