February 1, 2024 | Prudencio Tossou, Cas Wognum, Michael Craig, Hadrien Mary, and Emmanuel Noutahi
This study presents a rigorous framework for investigating molecular out-of-distribution (MOOD) generalization in drug discovery. The concept of MOOD is clarified through a problem specification that demonstrates how covariate shifts during real-world deployment can be characterized by the distribution of sample distances to the training set. These shifts can cause performance to drop by up to 60% and uncertainty calibration by up to 40%. To address this, a splitting protocol is proposed to close the gap between deployment and testing. A thorough investigation is conducted to assess the impact of model design, model selection, and dataset characteristics on MOOD performance and uncertainty calibration. The study finds that appropriate representations and algorithms with built-in uncertainty estimation are crucial for improving performance and uncertainty calibration.
The analysis of a model's applicability domain (AD) and its ML equivalent, out-of-distribution (OOD) detection, addresses generalizability by assessing queries and defining prediction reliability. However, these methods only approximate model robustness with respect to a portion of the chemical space and do not directly improve generalization. Generalization can typically be improved through data augmentation or collecting more data points, but these approaches are limited or costly. Therefore, the current literature focuses on alternatives such as model selection and design.
Model selection aims to choose the model with the best AD from a pool of candidates. A key factor in model selection is the splitting strategy that divides the available data into training, validation, and testing sets. In drug discovery, AD-oriented alternatives such as temporal and scaffold-based splits are increasingly popular. However, model selection may fail to produce a model with a better AD if none of the candidates can generalize, which is expected when the training algorithm assumes all data partitions are independently and identically distributed (IID).
To improve the model AD, one can modify the molecular representation, modeling hypothesis, learning paradigm, loss function, or regularizer. Previous studies in domain generalization (DG) and unsupervised domain adaptation (UDA) have focused on relaxing the IID hypothesis to expand the AD. In DG, a model is trained with additional labels specifying each sample's domain. In UDA, the model is trained with unlabeled molecules from a novel chemical space. Both approaches aim to learn domain-invariant mechanisms of scoring that enable generalization to unseen domains during training. However, current definitions of domains in DG, choices of unlabeled sets in UDA, and DG and UDA learning algorithms are yet to produce better molecular generalization.
The study finds that UDA and DG methods for molecular scoring tend to perform similarly to or worse than IID methods. The attempts to use AD and OOD detection concepts in molecular scoring and the failure of model design and model selection to improve the model AD demonstrate the importance and complexity of MOOD generalization. The complexity stems from poor problem specification and a lack of consensus regarding the evaluation of ML-based molecular scorers.
The study proposes a specification of real-world distribution shifts encounteredThis study presents a rigorous framework for investigating molecular out-of-distribution (MOOD) generalization in drug discovery. The concept of MOOD is clarified through a problem specification that demonstrates how covariate shifts during real-world deployment can be characterized by the distribution of sample distances to the training set. These shifts can cause performance to drop by up to 60% and uncertainty calibration by up to 40%. To address this, a splitting protocol is proposed to close the gap between deployment and testing. A thorough investigation is conducted to assess the impact of model design, model selection, and dataset characteristics on MOOD performance and uncertainty calibration. The study finds that appropriate representations and algorithms with built-in uncertainty estimation are crucial for improving performance and uncertainty calibration.
The analysis of a model's applicability domain (AD) and its ML equivalent, out-of-distribution (OOD) detection, addresses generalizability by assessing queries and defining prediction reliability. However, these methods only approximate model robustness with respect to a portion of the chemical space and do not directly improve generalization. Generalization can typically be improved through data augmentation or collecting more data points, but these approaches are limited or costly. Therefore, the current literature focuses on alternatives such as model selection and design.
Model selection aims to choose the model with the best AD from a pool of candidates. A key factor in model selection is the splitting strategy that divides the available data into training, validation, and testing sets. In drug discovery, AD-oriented alternatives such as temporal and scaffold-based splits are increasingly popular. However, model selection may fail to produce a model with a better AD if none of the candidates can generalize, which is expected when the training algorithm assumes all data partitions are independently and identically distributed (IID).
To improve the model AD, one can modify the molecular representation, modeling hypothesis, learning paradigm, loss function, or regularizer. Previous studies in domain generalization (DG) and unsupervised domain adaptation (UDA) have focused on relaxing the IID hypothesis to expand the AD. In DG, a model is trained with additional labels specifying each sample's domain. In UDA, the model is trained with unlabeled molecules from a novel chemical space. Both approaches aim to learn domain-invariant mechanisms of scoring that enable generalization to unseen domains during training. However, current definitions of domains in DG, choices of unlabeled sets in UDA, and DG and UDA learning algorithms are yet to produce better molecular generalization.
The study finds that UDA and DG methods for molecular scoring tend to perform similarly to or worse than IID methods. The attempts to use AD and OOD detection concepts in molecular scoring and the failure of model design and model selection to improve the model AD demonstrate the importance and complexity of MOOD generalization. The complexity stems from poor problem specification and a lack of consensus regarding the evaluation of ML-based molecular scorers.
The study proposes a specification of real-world distribution shifts encountered