Can LLM Graph Reasoning Generalize beyond Pattern Memorization?

Can LLM Graph Reasoning Generalize beyond Pattern Memorization?

23 Jun 2024 | Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaoxuan Tan, Xiaochuang Han, Tianxing He, Yulia Tsvetkov
Can LLM Graph Reasoning Generalize beyond Pattern Memorization? This paper investigates whether large language models (LLMs) can generalize graph reasoning beyond memorizing patterns in synthetic training data. The authors propose the NLGIFT benchmark, which evaluates LLMs on five types of graph reasoning patterns: semantic, numerical, structural, reasoning, and real-world. The benchmark includes 37,000 problems, with LLMs trained on a subset of problems and tested on both in-distribution and out-of-distribution data. The results show that while LLMs can generalize on simple patterns like semantic and numerical, they struggle with more complex reasoning and real-world patterns. This suggests that synthetic graph tuning may not be effective for real-world tasks with underlying network structures. The authors explore three strategies to improve LLM graph reasoning generalization: code mixing, machine-generated chain of thoughts, and post-training alignment. While post-training alignment shows promise for real-world tasks, empowering LLMs to go beyond pattern memorization remains an open research question. The paper also discusses the impact of training data on LLM graph reasoning generalization, finding that keyword frequency in the training corpus significantly affects performance. Additionally, the study highlights the importance of considering different graph structures and sizes when evaluating LLMs. The authors conclude that LLMs are not robust graph reasoners but mostly pattern regurgitators. They propose the NLGIFT benchmark as a tool for evaluating LLM graph reasoning generalization across various patterns and tasks. The study emphasizes the need for further research to improve LLM graph reasoning capabilities and address the challenges of generalization.Can LLM Graph Reasoning Generalize beyond Pattern Memorization? This paper investigates whether large language models (LLMs) can generalize graph reasoning beyond memorizing patterns in synthetic training data. The authors propose the NLGIFT benchmark, which evaluates LLMs on five types of graph reasoning patterns: semantic, numerical, structural, reasoning, and real-world. The benchmark includes 37,000 problems, with LLMs trained on a subset of problems and tested on both in-distribution and out-of-distribution data. The results show that while LLMs can generalize on simple patterns like semantic and numerical, they struggle with more complex reasoning and real-world patterns. This suggests that synthetic graph tuning may not be effective for real-world tasks with underlying network structures. The authors explore three strategies to improve LLM graph reasoning generalization: code mixing, machine-generated chain of thoughts, and post-training alignment. While post-training alignment shows promise for real-world tasks, empowering LLMs to go beyond pattern memorization remains an open research question. The paper also discusses the impact of training data on LLM graph reasoning generalization, finding that keyword frequency in the training corpus significantly affects performance. Additionally, the study highlights the importance of considering different graph structures and sizes when evaluating LLMs. The authors conclude that LLMs are not robust graph reasoners but mostly pattern regurgitators. They propose the NLGIFT benchmark as a tool for evaluating LLM graph reasoning generalization across various patterns and tasks. The study emphasizes the need for further research to improve LLM graph reasoning capabilities and address the challenges of generalization.
Reach us at info@study.space
[slides] Can LLM Graph Reasoning Generalize beyond Pattern Memorization%3F | StudySpace