The paper "Can LLM Graph Reasoning Generalize beyond Pattern Memorization?" by Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xiaochuang Han, Tianxing He, and Yulia Tsvetkov explores the generalization capabilities of large language models (LLMs) in graph reasoning tasks. The authors propose the NLGIFT benchmark, a comprehensive evaluation suite designed to assess whether LLMs can generalize beyond pattern memorization in synthetic training data to real-world graph-based tasks. The benchmark includes five types of patterns: semantic, numerical, structural, reasoning, and real-world, with 37,000 problems in total.
Experiments with two LLMs, CHATGPT and LLAMA2-7B, across four graph reasoning tasks show that while LLMs can achieve significant transfer on simple patterns (semantic, numerical, and structural), they struggle to generalize to more complex reasoning and real-world patterns. Specifically, only 33% of the time do LLMs achieve strong recovery on reasoning patterns, and they fail to generalize at all on real-world patterns. The authors explore three strategies to improve generalization: code mixing, machine-generated Chain of Thoughts (CoTs), and post-training alignment. Post-training alignment is found to be the most promising approach, but the challenge of enabling LLMs to go beyond pattern memorization remains open.
The paper concludes that while LLMs show some robustness to changes in semantic and numerical attributes, they face significant challenges in generalizing to real-world tasks involving networks and structures. The findings highlight the need for further research to enhance LLMs' ability to reason about graphs in a more general and transferable manner.The paper "Can LLM Graph Reasoning Generalize beyond Pattern Memorization?" by Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xiaochuang Han, Tianxing He, and Yulia Tsvetkov explores the generalization capabilities of large language models (LLMs) in graph reasoning tasks. The authors propose the NLGIFT benchmark, a comprehensive evaluation suite designed to assess whether LLMs can generalize beyond pattern memorization in synthetic training data to real-world graph-based tasks. The benchmark includes five types of patterns: semantic, numerical, structural, reasoning, and real-world, with 37,000 problems in total.
Experiments with two LLMs, CHATGPT and LLAMA2-7B, across four graph reasoning tasks show that while LLMs can achieve significant transfer on simple patterns (semantic, numerical, and structural), they struggle to generalize to more complex reasoning and real-world patterns. Specifically, only 33% of the time do LLMs achieve strong recovery on reasoning patterns, and they fail to generalize at all on real-world patterns. The authors explore three strategies to improve generalization: code mixing, machine-generated Chain of Thoughts (CoTs), and post-training alignment. Post-training alignment is found to be the most promising approach, but the challenge of enabling LLMs to go beyond pattern memorization remains open.
The paper concludes that while LLMs show some robustness to changes in semantic and numerical attributes, they face significant challenges in generalizing to real-world tasks involving networks and structures. The findings highlight the need for further research to enhance LLMs' ability to reason about graphs in a more general and transferable manner.