22 Jul 2024 | Chaojie Wang, Yanchen Deng, Zhiyi Lyu, Liang Zeng, Jujie He, Shuicheng Yan, Bo An
The paper introduces Q*, a framework designed to enhance the multi-step reasoning capabilities of Large Language Models (LLMs). Q* addresses the limitations of LLMs, which often produce errors and hallucinations when performing complex reasoning tasks due to their auto-regressive nature. By framing multi-step reasoning as a heuristic search problem, Q* leverages a plug-and-play Q-value model to guide LLMs in selecting the most promising next steps without requiring fine-tuning for each task. This approach avoids computational overhead and potential performance degradation on other tasks. The method is evaluated on datasets such as GSM8K, MATH, and MBPP, demonstrating significant improvements in multi-step reasoning performance compared to existing methods. Q* is shown to be effective across various reasoning tasks, including math word problems and code generation, by using a combination of offline reinforcement learning, rollout, and stronger LLMs to estimate optimal Q-values.The paper introduces Q*, a framework designed to enhance the multi-step reasoning capabilities of Large Language Models (LLMs). Q* addresses the limitations of LLMs, which often produce errors and hallucinations when performing complex reasoning tasks due to their auto-regressive nature. By framing multi-step reasoning as a heuristic search problem, Q* leverages a plug-and-play Q-value model to guide LLMs in selecting the most promising next steps without requiring fine-tuning for each task. This approach avoids computational overhead and potential performance degradation on other tasks. The method is evaluated on datasets such as GSM8K, MATH, and MBPP, demonstrating significant improvements in multi-step reasoning performance compared to existing methods. Q* is shown to be effective across various reasoning tasks, including math word problems and code generation, by using a combination of offline reinforcement learning, rollout, and stronger LLMs to estimate optimal Q-values.