[slides and audio] Can LLMs Reason in the Wild with Programs%3F

The paper "Can LLMs Reason in the Wild with Programs?" by Yuan Yang, Siheng Xiong, Ali Payani, Ehsan Shareghi, and Faramarz Fekri explores the capabilities of Large Language Models (LLMs) in solving reasoning problems in real-world scenarios. The authors introduce the "reasoning in the wild" task, where LLMs are tasked with identifying sub-problems and their corresponding formalisms, and writing programs to solve each sub-problem guided by a tactic. They create a large dataset, ReWild, containing detailed solutions to a diverse set of reasoning problems, ranging from well-defined single-form reasoning to ambiguous and hybrid ones. The dataset includes 6.7K trajectories with a total of 21.7M tokens. Experiments with various LLMs reveal significant limitations, including overfitting on popular problems, lack of instruction-following capability in long contexts, and generating trivial programs on ambiguous reasoning problems. Fine-tuning a local LLM on the ReWild dataset leads to improved performance, with the Tactic-Guided ReasonER (TIGER-8B) model achieving GPT4-level performance. The paper highlights the need for more realistic evaluation benchmarks and fine-grained metrics to assess LLMs' reasoning abilities, particularly in handling ambiguous and mixed-scope problems. The findings underscore the importance of fine-tuning and the potential of tactic-guided approaches to enhance LLMs' reasoning capabilities.The paper "Can LLMs Reason in the Wild with Programs?" by Yuan Yang, Siheng Xiong, Ali Payani, Ehsan Shareghi, and Faramarz Fekri explores the capabilities of Large Language Models (LLMs) in solving reasoning problems in real-world scenarios. The authors introduce the "reasoning in the wild" task, where LLMs are tasked with identifying sub-problems and their corresponding formalisms, and writing programs to solve each sub-problem guided by a tactic. They create a large dataset, ReWild, containing detailed solutions to a diverse set of reasoning problems, ranging from well-defined single-form reasoning to ambiguous and hybrid ones. The dataset includes 6.7K trajectories with a total of 21.7M tokens. Experiments with various LLMs reveal significant limitations, including overfitting on popular problems, lack of instruction-following capability in long contexts, and generating trivial programs on ambiguous reasoning problems. Fine-tuning a local LLM on the ReWild dataset leads to improved performance, with the Tactic-Guided ReasonER (TIGER-8B) model achieving GPT4-level performance. The paper highlights the need for more realistic evaluation benchmarks and fine-grained metrics to assess LLMs' reasoning abilities, particularly in handling ambiguous and mixed-scope problems. The findings underscore the importance of fine-tuning and the potential of tactic-guided approaches to enhance LLMs' reasoning capabilities.

Can LLMs Reason in the Wild with Programs?

19 Jun 2024 | Yuan Yang1, Siheng Xiong1, Ali Payani2, Ehsan Shareghi3 & Faramarz Fekri1