2024 | Hajime Shimakawa, Akiko Kumada, and Masahiro Sato
This paper presents a comprehensive benchmark to assess the extrapolative performance of machine learning (ML) models in predicting molecular properties, particularly focusing on small-scale experimental datasets. The benchmark evaluates 12 organic molecular properties across various datasets, revealing significant performance degradation in conventional ML models beyond the training distribution. To address this challenge, the authors introduce a quantum-mechanical (QM) descriptor dataset called QMex and an interactive linear regression (ILR) model that incorporates interaction terms between QM descriptors and categorical information about molecular structures. The QMex-based ILR model achieves state-of-the-art extrapolative performance while maintaining interpretability. The study highlights the importance of QM descriptors and linear models in overcoming the limitations of property range and molecular structure within the training data. The proposed model and QMex descriptors are expected to be valuable tools for improving extrapolative predictions with limited experimental data and discovering novel materials or molecules.This paper presents a comprehensive benchmark to assess the extrapolative performance of machine learning (ML) models in predicting molecular properties, particularly focusing on small-scale experimental datasets. The benchmark evaluates 12 organic molecular properties across various datasets, revealing significant performance degradation in conventional ML models beyond the training distribution. To address this challenge, the authors introduce a quantum-mechanical (QM) descriptor dataset called QMex and an interactive linear regression (ILR) model that incorporates interaction terms between QM descriptors and categorical information about molecular structures. The QMex-based ILR model achieves state-of-the-art extrapolative performance while maintaining interpretability. The study highlights the importance of QM descriptors and linear models in overcoming the limitations of property range and molecular structure within the training data. The proposed model and QMex descriptors are expected to be valuable tools for improving extrapolative predictions with limited experimental data and discovering novel materials or molecules.