20 Nov 2019 | Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, Andrew Palmer, Volker Settels, Tommi Jaakkola, Klavs Jensen, and Regina Barzilay
This paper evaluates the performance of learned molecular representations for property prediction, comparing neural networks applied to computed molecular fingerprints or expert-crafted descriptors with graph convolutional neural networks (GCNs). The authors benchmark their models on 19 public and 16 proprietary industrial datasets, introducing a GCN that consistently matches or outperforms models using fixed molecular descriptors and previous graph neural architectures. They find that while learned representations have not yet reached the level of experimental reproducibility, their proposed model offers significant improvements over existing industrial models. The study also explores the impact of molecular representation on dataset characteristics, noting that a hybrid representation combining convolutions and descriptors yields higher performance and better generalization. Hyperparameter selection and ensemble techniques are shown to further enhance model accuracy. The results indicate that learned molecular representations are applicable and ready for use in drug discovery workflows.This paper evaluates the performance of learned molecular representations for property prediction, comparing neural networks applied to computed molecular fingerprints or expert-crafted descriptors with graph convolutional neural networks (GCNs). The authors benchmark their models on 19 public and 16 proprietary industrial datasets, introducing a GCN that consistently matches or outperforms models using fixed molecular descriptors and previous graph neural architectures. They find that while learned representations have not yet reached the level of experimental reproducibility, their proposed model offers significant improvements over existing industrial models. The study also explores the impact of molecular representation on dataset characteristics, noting that a hybrid representation combining convolutions and descriptors yields higher performance and better generalization. Hyperparameter selection and ensemble techniques are shown to further enhance model accuracy. The results indicate that learned molecular representations are applicable and ready for use in drug discovery workflows.