18 Jun 2019 | Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, Stephan Günnemann
Graph neural networks (GNNs) have achieved significant success in semi-supervised node classification. However, the evaluation of GNN models is fraught with pitfalls. This paper highlights the limitations of current evaluation strategies, showing that using the same train/validation/test splits and training procedures across different models leads to unfair comparisons. The authors evaluated four prominent GNN models—GCN, MoNet, GAT, and GraphSAGE—on four well-known citation networks and four new datasets. They found that different data splits can lead to dramatically different rankings of models. Moreover, simpler GNN architectures can outperform more complex ones when hyperparameters and training procedures are tuned fairly for all models.
The study emphasizes the importance of using standardized training and hyperparameter selection procedures to ensure fair comparisons. The authors also introduced four new datasets for node classification and used a standardized setup with 100 random train/validation/test splits and 20 random weight initializations for each split. This approach allows for a more accurate assessment of generalization performance.
Results show that GNN-based approaches significantly outperform baselines like MLP and LogReg. However, no single GNN model consistently outperforms others across all datasets. GCN achieved the best performance overall, suggesting that simpler models can be effective when properly tuned. The study also highlights the fragility of results based on a single train/validation/test split, as different splits can lead to completely different rankings of models. This underscores the need for evaluation strategies based on multiple splits to ensure robustness. The authors conclude that future research should focus on more robust evaluation procedures to better understand the strengths and limitations of different GNN models.Graph neural networks (GNNs) have achieved significant success in semi-supervised node classification. However, the evaluation of GNN models is fraught with pitfalls. This paper highlights the limitations of current evaluation strategies, showing that using the same train/validation/test splits and training procedures across different models leads to unfair comparisons. The authors evaluated four prominent GNN models—GCN, MoNet, GAT, and GraphSAGE—on four well-known citation networks and four new datasets. They found that different data splits can lead to dramatically different rankings of models. Moreover, simpler GNN architectures can outperform more complex ones when hyperparameters and training procedures are tuned fairly for all models.
The study emphasizes the importance of using standardized training and hyperparameter selection procedures to ensure fair comparisons. The authors also introduced four new datasets for node classification and used a standardized setup with 100 random train/validation/test splits and 20 random weight initializations for each split. This approach allows for a more accurate assessment of generalization performance.
Results show that GNN-based approaches significantly outperform baselines like MLP and LogReg. However, no single GNN model consistently outperforms others across all datasets. GCN achieved the best performance overall, suggesting that simpler models can be effective when properly tuned. The study also highlights the fragility of results based on a single train/validation/test split, as different splits can lead to completely different rankings of models. This underscores the need for evaluation strategies based on multiple splits to ensure robustness. The authors conclude that future research should focus on more robust evaluation procedures to better understand the strengths and limitations of different GNN models.