2024-5-29 | Xinyun Chen, Ryan A. Chi, Xuezhi Wang and Denny Zhou
Large language models (LLMs) demonstrate significant performance differences based on the order of premises in reasoning tasks, even when the underlying task remains the same. This study reveals that LLMs perform best when premises are ordered in a way that aligns with the logical steps required to reach a conclusion. For example, in deductive reasoning, presenting premises in the same order as the ground truth proof increases model accuracy, while random or reversed orderings lead to significant performance drops. The study evaluates this effect across various LLMs, including GPT-4-turbo, GPT-3.5-turbo, PaLM 2-L, and Gemini 1.0 Pro, and finds that the accuracy can decrease by over 30% when premise order is altered. This effect is further amplified when irrelevant premises are included in the prompt.
The study also introduces the R-GSM benchmark, based on GSM8K, to examine the impact of premise order on mathematical reasoning. Results show that LLMs perform significantly worse on reordered problems, even when the correct answer remains the same. The study highlights that LLMs are more comfortable with left-to-right premise ordering, which aligns with human reasoning preferences. However, LLMs exhibit varying degrees of sensitivity to premise order, with some models performing better with backward chaining and others struggling with random or reversed orders.
The findings suggest that the auto-regressive nature of LLMs and the reasoning biases learned during training contribute to their sensitivity to premise order. While humans are relatively insensitive to premise order in simple reasoning tasks, LLMs show a clear preference for specific orderings. The study proposes that future work should focus on developing new training and modeling techniques to mitigate the impact of premise order on LLM reasoning.Large language models (LLMs) demonstrate significant performance differences based on the order of premises in reasoning tasks, even when the underlying task remains the same. This study reveals that LLMs perform best when premises are ordered in a way that aligns with the logical steps required to reach a conclusion. For example, in deductive reasoning, presenting premises in the same order as the ground truth proof increases model accuracy, while random or reversed orderings lead to significant performance drops. The study evaluates this effect across various LLMs, including GPT-4-turbo, GPT-3.5-turbo, PaLM 2-L, and Gemini 1.0 Pro, and finds that the accuracy can decrease by over 30% when premise order is altered. This effect is further amplified when irrelevant premises are included in the prompt.
The study also introduces the R-GSM benchmark, based on GSM8K, to examine the impact of premise order on mathematical reasoning. Results show that LLMs perform significantly worse on reordered problems, even when the correct answer remains the same. The study highlights that LLMs are more comfortable with left-to-right premise ordering, which aligns with human reasoning preferences. However, LLMs exhibit varying degrees of sensitivity to premise order, with some models performing better with backward chaining and others struggling with random or reversed orders.
The findings suggest that the auto-regressive nature of LLMs and the reasoning biases learned during training contribute to their sensitivity to premise order. While humans are relatively insensitive to premise order in simple reasoning tasks, LLMs show a clear preference for specific orderings. The study proposes that future work should focus on developing new training and modeling techniques to mitigate the impact of premise order on LLM reasoning.