Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs

Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs

21 Jun 2024 | Siyuan Wang, Zhongyu Wei, Yejin Choi, Xiang Ren
This paper investigates the logical reasoning capabilities of large language models (LLMs) and proposes a logic scaffolding inferential rule generation framework to enhance their understanding of inferential rules. The authors construct ULogic, an inferential rule base with 8,000 primitive rules and 6,000 compositional rules across five domains: object affordance, accessibility, interaction, location, and person's need. They find that while LLMs like GPT-4 perform well on basic rules, they struggle with complex and compositional rules, exhibiting biases in their reasoning. To address these limitations, the authors distill ULogic into a smaller inference engine capable of generating accurate, complex, and abstract conclusions and premises. This engine is evaluated through a multi-judge mechanism, showing superior performance in various commonsense reasoning tasks compared to GPT-3.5-Turbo and even GPT-4. The study highlights the need for LLMs to improve their logical reasoning abilities, particularly in handling complex and compositional rules, and provides a practical approach to enhance their capabilities.This paper investigates the logical reasoning capabilities of large language models (LLMs) and proposes a logic scaffolding inferential rule generation framework to enhance their understanding of inferential rules. The authors construct ULogic, an inferential rule base with 8,000 primitive rules and 6,000 compositional rules across five domains: object affordance, accessibility, interaction, location, and person's need. They find that while LLMs like GPT-4 perform well on basic rules, they struggle with complex and compositional rules, exhibiting biases in their reasoning. To address these limitations, the authors distill ULogic into a smaller inference engine capable of generating accurate, complex, and abstract conclusions and premises. This engine is evaluated through a multi-judge mechanism, showing superior performance in various commonsense reasoning tasks compared to GPT-3.5-Turbo and even GPT-4. The study highlights the need for LLMs to improve their logical reasoning abilities, particularly in handling complex and compositional rules, and provides a practical approach to enhance their capabilities.
Reach us at info@study.space