Understanding MoleculeNet%3A a benchmark for molecular machine learning

MoleculeNet is a benchmark for molecular machine learning that curates multiple public datasets, establishes evaluation metrics, and provides high-quality open-source implementations of molecular featurization and learning algorithms. It includes data on over 700,000 compounds, with datasets covering quantum mechanics, physical chemistry, biophysics, and physiology. MoleculeNet benchmarks demonstrate that learnable representations are powerful for molecular machine learning but face challenges with data scarcity and imbalanced classification. Physics-aware featurizations are particularly important for quantum mechanical and biophysical datasets. MoleculeNet is built on the open-source DeepChem package and includes datasets such as QM7, QM8, QM9, ESOL, FreeSolv, Lipophilicity, PCBA, MUV, HIV, PDBbind, BACE, BBBP, Tox21, ToxCast, and SIDER. These datasets are split into training, validation, and test sets with various methods, including random, scaffold, stratified, and time splitting. Evaluation metrics include mean absolute error (MAE), root-mean-square error (RMSE), area under the curve (AUC) for ROC and PRC curves. MoleculeNet provides implementations of various molecular featurization methods, including ECFP, Coulomb Matrix, Grid Featurizer, Symmetry Function, and Graph Convolutions. It also includes a range of machine learning models, such as logistic regression, support vector classification, kernel ridge regression, random forests, gradient boosting, multitask networks, bypass networks, and influence relevance voting. Graph-based models include graph convolutional models, weave models, directed acyclic graph models, deep tensor neural networks, ANI-1, and message passing neural networks. The results show that learnable representations perform best for molecular machine learning tasks, but their effectiveness varies depending on the dataset and task. MoleculeNet provides benchmark results for various tasks and datasets, and its open-source implementation allows for further development and comparison of machine learning methods in molecular science.MoleculeNet is a benchmark for molecular machine learning that curates multiple public datasets, establishes evaluation metrics, and provides high-quality open-source implementations of molecular featurization and learning algorithms. It includes data on over 700,000 compounds, with datasets covering quantum mechanics, physical chemistry, biophysics, and physiology. MoleculeNet benchmarks demonstrate that learnable representations are powerful for molecular machine learning but face challenges with data scarcity and imbalanced classification. Physics-aware featurizations are particularly important for quantum mechanical and biophysical datasets. MoleculeNet is built on the open-source DeepChem package and includes datasets such as QM7, QM8, QM9, ESOL, FreeSolv, Lipophilicity, PCBA, MUV, HIV, PDBbind, BACE, BBBP, Tox21, ToxCast, and SIDER. These datasets are split into training, validation, and test sets with various methods, including random, scaffold, stratified, and time splitting. Evaluation metrics include mean absolute error (MAE), root-mean-square error (RMSE), area under the curve (AUC) for ROC and PRC curves. MoleculeNet provides implementations of various molecular featurization methods, including ECFP, Coulomb Matrix, Grid Featurizer, Symmetry Function, and Graph Convolutions. It also includes a range of machine learning models, such as logistic regression, support vector classification, kernel ridge regression, random forests, gradient boosting, multitask networks, bypass networks, and influence relevance voting. Graph-based models include graph convolutional models, weave models, directed acyclic graph models, deep tensor neural networks, ANI-1, and message passing neural networks. The results show that learnable representations perform best for molecular machine learning tasks, but their effectiveness varies depending on the dataset and task. MoleculeNet provides benchmark results for various tasks and datasets, and its open-source implementation allows for further development and comparison of machine learning methods in molecular science.

MoleculeNet: A Benchmark for Molecular Machine Learning

26 Oct 2018 | Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, Vijay Pande