16 Jun 2024 | Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J. Su, Camillo J. Taylor, Dan Roth
This study introduces a hypothesis-testing framework to assess whether large language models (LLMs) possess genuine reasoning abilities or primarily depend on token bias. The authors develop controlled synthetic datasets featuring conjunction fallacy and syllogistic problems to evaluate LLMs' performance. The framework outlines a series of hypotheses where token biases are readily identifiable, with all null hypotheses assuming genuine reasoning capabilities of LLMs. The findings suggest, with statistical guarantee, that most LLMs still struggle with logical reasoning and rely heavily on recognizing superficial patterns, raising concerns about their actual reasoning and generalization abilities. The study highlights the need for a rigorous statistical testing framework to evaluate LLMs' reasoning capabilities, distinguishing between genuine reasoning and token bias. The results indicate that token bias contributes significantly to performance improvements in reasoning tasks, rather than genuine advances in reasoning capabilities. The authors conclude that LLMs may not engage in true reasoning but instead rely on semantic shortcuts and superficial patterns, suggesting further investigations into the underlying mechanisms and limitations of LLMs' reasoning abilities.This study introduces a hypothesis-testing framework to assess whether large language models (LLMs) possess genuine reasoning abilities or primarily depend on token bias. The authors develop controlled synthetic datasets featuring conjunction fallacy and syllogistic problems to evaluate LLMs' performance. The framework outlines a series of hypotheses where token biases are readily identifiable, with all null hypotheses assuming genuine reasoning capabilities of LLMs. The findings suggest, with statistical guarantee, that most LLMs still struggle with logical reasoning and rely heavily on recognizing superficial patterns, raising concerns about their actual reasoning and generalization abilities. The study highlights the need for a rigorous statistical testing framework to evaluate LLMs' reasoning capabilities, distinguishing between genuine reasoning and token bias. The results indicate that token bias contributes significantly to performance improvements in reasoning tasks, rather than genuine advances in reasoning capabilities. The authors conclude that LLMs may not engage in true reasoning but instead rely on semantic shortcuts and superficial patterns, suggesting further investigations into the underlying mechanisms and limitations of LLMs' reasoning abilities.