Understanding Q*%3A Improving Multi-step Reasoning for LLMs with Deliberative Planning

This paper introduces Q*, a general, versatile, and agile deliberative planning framework to improve multi-step reasoning for large language models (LLMs). LLMs are prone to errors, hallucinations, and inconsistent statements during multi-step reasoning due to their auto-regressive generation process. Q* addresses this by treating multi-step reasoning as a heuristic search problem, where it uses a Q-value model as a heuristic function to guide LLMs in selecting the most promising next reasoning step without requiring fine-tuning. This approach avoids significant computational overhead and potential performance degradation on other tasks. The Q* framework is based on A* search, where the f-value of each state is calculated as a weighted sum of the accumulated utility and the heuristic value. The heuristic value is derived from the optimal Q-value of the state. Q* estimates the optimal Q-value using three methods: offline reinforcement learning, rollouts, and completion with stronger LLMs. These methods allow Q* to effectively guide LLMs to select the most promising next step without modifying the LLM's parameters. Extensive experiments on the GSM8K, MATH, and MBPP datasets demonstrate that Q* significantly improves the multi-step reasoning capability of existing open-source LLMs. On GSM8K, Q* achieves an accuracy of 80.8%, surpassing the performance of ChatGPT-turbo. On MATH, Q* improves the performance of both Llama-2-7b and DeepSeek-Math-7b, achieving 55.4% accuracy. On MBPP, Q* helps CodeQwen1.5-7b-Chat achieve 77.0% accuracy. Q* is efficient because it only considers a single step at a time, making it much cheaper than complete rollouts in MCTS-based methods. The framework is applicable to various reasoning tasks without modification, making it a versatile solution for improving LLM reasoning capabilities.This paper introduces Q*, a general, versatile, and agile deliberative planning framework to improve multi-step reasoning for large language models (LLMs). LLMs are prone to errors, hallucinations, and inconsistent statements during multi-step reasoning due to their auto-regressive generation process. Q* addresses this by treating multi-step reasoning as a heuristic search problem, where it uses a Q-value model as a heuristic function to guide LLMs in selecting the most promising next reasoning step without requiring fine-tuning. This approach avoids significant computational overhead and potential performance degradation on other tasks. The Q* framework is based on A* search, where the f-value of each state is calculated as a weighted sum of the accumulated utility and the heuristic value. The heuristic value is derived from the optimal Q-value of the state. Q* estimates the optimal Q-value using three methods: offline reinforcement learning, rollouts, and completion with stronger LLMs. These methods allow Q* to effectively guide LLMs to select the most promising next step without modifying the LLM's parameters. Extensive experiments on the GSM8K, MATH, and MBPP datasets demonstrate that Q* significantly improves the multi-step reasoning capability of existing open-source LLMs. On GSM8K, Q* achieves an accuracy of 80.8%, surpassing the performance of ChatGPT-turbo. On MATH, Q* improves the performance of both Llama-2-7b and DeepSeek-Math-7b, achieving 55.4% accuracy. On MBPP, Q* helps CodeQwen1.5-7b-Chat achieve 77.0% accuracy. Q* is efficient because it only considers a single step at a time, making it much cheaper than complete rollouts in MCTS-based methods. The framework is applicable to various reasoning tasks without modification, making it a versatile solution for improving LLM reasoning capabilities.

Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning

22 Jul 2024 | Chaojie Wang, Yanchen Deng, Zhiyi Lyu, Liang Zeng, Jujie He, Shuicheng Yan, Bo An