Can LLMs Reason in the Wild with Programs?

Can LLMs Reason in the Wild with Programs?

19 Jun 2024 | Yuan Yang¹, Siheng Xiong¹, Ali Payani², Ehsan Shareghi³ & Faramarz Fekri¹
This paper introduces the task of reasoning in the wild, where an LLM is tasked to solve a reasoning problem of unknown type by identifying sub-problems and their corresponding formalisms, and writing a program to solve each sub-problem, guided by a tactic. We create a large tactic-guided trajectory dataset, ReWild, containing detailed solutions to a diverse set of reasoning problems, ranging from well-defined single-form reasoning to ambiguous and hybrid ones. This allows us to test various aspects of LLMs' reasoning at the fine-grained level such as the selection and execution of tactics, and the tendency to take undesired shortcuts. In experiments, we highlight that existing LLMs fail significantly on problems with ambiguous and mixed scope, revealing critical limitations and overfitting issues. We further show fine-tuning a local LLM on the trajectories data leads to better performance. We also introduce the task of reasoning in the wild, which involves solving reasoning problems by writing programs and interacting with an environment defined by a tactic. We demonstrate how to set up a unified reasoning framework to tackle various existing and new reasoning problems, and how to incorporate various mechanisms in the reasoning process which will enable a more fine-grained analysis of system's abilities. We also show how to evaluate such complex trajectories to gain deeper insight into LLMs' limitations and behavior beyond common holistic evaluation protocols. We evaluate a diverse set of the most powerful LLMs on our benchmark and find that existing LLMs fail significantly on problems with tactic-guided programs and the performance further deteriorates for hybrid problems. Through the lens of tactics, we analyze the results and identify three critical limitations of existing LLMs: (1) Many LLMs show “overfitted” behavior and fail to follow the tactic on popular problems such as GSM8K, leading to a drop in performance; (2) Most LLMs, except for GPT4 series, show a lack of the capability of “instruction-following in long context”, where it fails to follow the tactic on trajectories that are typically 3K long; (3) Powerful LLMs including GPT4 tend to hallucinate and generate “trivial programs” on ambiguous reasoning problems, showing a poor generalizability over out-of-distribution problems. Finally, we show that these limitations can be alleviated via fine-tuning. We train and release a LLaMA3-8B model on ReWild, which we refer to as TactIc-Guided ReasonER (TIGER-8B), and show it achieves GPT4 level performance. We also evaluate LLMs on hybrid problems and find that they struggle with difficult problems, indicating that hybrid problems remain a highly challenging and valuable benchmark that provides deep insights into LLMs' reasoning capability.This paper introduces the task of reasoning in the wild, where an LLM is tasked to solve a reasoning problem of unknown type by identifying sub-problems and their corresponding formalisms, and writing a program to solve each sub-problem, guided by a tactic. We create a large tactic-guided trajectory dataset, ReWild, containing detailed solutions to a diverse set of reasoning problems, ranging from well-defined single-form reasoning to ambiguous and hybrid ones. This allows us to test various aspects of LLMs' reasoning at the fine-grained level such as the selection and execution of tactics, and the tendency to take undesired shortcuts. In experiments, we highlight that existing LLMs fail significantly on problems with ambiguous and mixed scope, revealing critical limitations and overfitting issues. We further show fine-tuning a local LLM on the trajectories data leads to better performance. We also introduce the task of reasoning in the wild, which involves solving reasoning problems by writing programs and interacting with an environment defined by a tactic. We demonstrate how to set up a unified reasoning framework to tackle various existing and new reasoning problems, and how to incorporate various mechanisms in the reasoning process which will enable a more fine-grained analysis of system's abilities. We also show how to evaluate such complex trajectories to gain deeper insight into LLMs' limitations and behavior beyond common holistic evaluation protocols. We evaluate a diverse set of the most powerful LLMs on our benchmark and find that existing LLMs fail significantly on problems with tactic-guided programs and the performance further deteriorates for hybrid problems. Through the lens of tactics, we analyze the results and identify three critical limitations of existing LLMs: (1) Many LLMs show “overfitted” behavior and fail to follow the tactic on popular problems such as GSM8K, leading to a drop in performance; (2) Most LLMs, except for GPT4 series, show a lack of the capability of “instruction-following in long context”, where it fails to follow the tactic on trajectories that are typically 3K long; (3) Powerful LLMs including GPT4 tend to hallucinate and generate “trivial programs” on ambiguous reasoning problems, showing a poor generalizability over out-of-distribution problems. Finally, we show that these limitations can be alleviated via fine-tuning. We train and release a LLaMA3-8B model on ReWild, which we refer to as TactIc-Guided ReasonER (TIGER-8B), and show it achieves GPT4 level performance. We also evaluate LLMs on hybrid problems and find that they struggle with difficult problems, indicating that hybrid problems remain a highly challenging and valuable benchmark that provides deep insights into LLMs' reasoning capability.
Reach us at info@study.space