2024-5-29 | Xinyun Chen, Ryan A. Chi, Xuezhi Wang, Denny Zhou
The paper "Premise Order Matters in Reasoning with Large Language Models" by Xinyun Chen, Ryan A. Chi, Xuezhi Wang, and Denny Zhou investigates the impact of premise order on the reasoning performance of large language models (LLMs). The authors find that LLMs are significantly affected by the order of premises, even though the underlying task remains unchanged. Specifically, they observe that LLMs perform best when the premise order aligns with the context required in intermediate reasoning steps. For example, in deductive reasoning tasks, presenting premises in the same order as the ground truth proof in the prompt significantly improves accuracy. The study uses a variety of LLMs, including GPT-4-turbo, GPT-3.5-turbo, PaLM 2-L, and Gemini 1.0 Pro, and evaluates their performance on logical reasoning tasks. The results show that the accuracy drop caused by different ordering can be over 30%. Additionally, the authors introduce the R-GSM benchmark, which examines the ordering effect on mathematical problem-solving. The benchmark is based on GSM8K and includes manually verified ground-truth answers for problems with different premise orders. The experiments on R-GSM confirm the significant impact of premise order, with all LLMs performing worse on reordered problems. The paper discusses potential reasons for this effect, such as the auto-regressive model design and the reasoning bias learned from training data, and suggests future work on developing techniques to mitigate the ordering effect.The paper "Premise Order Matters in Reasoning with Large Language Models" by Xinyun Chen, Ryan A. Chi, Xuezhi Wang, and Denny Zhou investigates the impact of premise order on the reasoning performance of large language models (LLMs). The authors find that LLMs are significantly affected by the order of premises, even though the underlying task remains unchanged. Specifically, they observe that LLMs perform best when the premise order aligns with the context required in intermediate reasoning steps. For example, in deductive reasoning tasks, presenting premises in the same order as the ground truth proof in the prompt significantly improves accuracy. The study uses a variety of LLMs, including GPT-4-turbo, GPT-3.5-turbo, PaLM 2-L, and Gemini 1.0 Pro, and evaluates their performance on logical reasoning tasks. The results show that the accuracy drop caused by different ordering can be over 30%. Additionally, the authors introduce the R-GSM benchmark, which examines the ordering effect on mathematical problem-solving. The benchmark is based on GSM8K and includes manually verified ground-truth answers for problems with different premise orders. The experiments on R-GSM confirm the significant impact of premise order, with all LLMs performing worse on reordered problems. The paper discusses potential reasons for this effect, such as the auto-regressive model design and the reasoning bias learned from training data, and suggests future work on developing techniques to mitigate the ordering effect.