2024 | Prudencio Tossou, Cas Wognum, Michael Craig, Hadrien Mary, Emmanuel Noutahi
This study presents a rigorous framework for investigating molecular out-of-distribution (MOOD) generalization in drug discovery. The authors clarify the concept of MOOD through a problem specification, demonstrating how covariate shifts encountered during real-world deployment can be characterized by the distribution of sample distances to the training set. These shifts can cause performance drops of up to 60% and uncertainty calibration improvements of up to 40%. To address this, they propose a splitting protocol to bridge the gap between deployment and testing. Using this protocol, they conduct a thorough investigation into the impact of model design, selection, and dataset characteristics on MOOD performance and uncertainty calibration. They find that appropriate representations and algorithms with built-in uncertainty estimation are crucial for improving performance and calibration. The study highlights the limited transferability of best out-of-distribution practices from other machine learning disciplines to molecular modeling, emphasizing the need for a novel MOOD modeling paradigm. The findings suggest that model design tools are more influential for MOOD generalization than model selection tools, with molecular representations being more impactful for performance and algorithms with built-in uncertainty estimates providing better calibrated uncertainties. The study also reveals that deep molecular representations do not outperform other representations in terms of performance but can be improved, and that out-of-distribution algorithms require further development to surpass in-distribution algorithms in both uncertainty calibration and generalization. Overall, the study opens an exciting avenue for benchmarking meaningful algorithmic progress in molecular scoring.This study presents a rigorous framework for investigating molecular out-of-distribution (MOOD) generalization in drug discovery. The authors clarify the concept of MOOD through a problem specification, demonstrating how covariate shifts encountered during real-world deployment can be characterized by the distribution of sample distances to the training set. These shifts can cause performance drops of up to 60% and uncertainty calibration improvements of up to 40%. To address this, they propose a splitting protocol to bridge the gap between deployment and testing. Using this protocol, they conduct a thorough investigation into the impact of model design, selection, and dataset characteristics on MOOD performance and uncertainty calibration. They find that appropriate representations and algorithms with built-in uncertainty estimation are crucial for improving performance and calibration. The study highlights the limited transferability of best out-of-distribution practices from other machine learning disciplines to molecular modeling, emphasizing the need for a novel MOOD modeling paradigm. The findings suggest that model design tools are more influential for MOOD generalization than model selection tools, with molecular representations being more impactful for performance and algorithms with built-in uncertainty estimates providing better calibrated uncertainties. The study also reveals that deep molecular representations do not outperform other representations in terms of performance but can be improved, and that out-of-distribution algorithms require further development to surpass in-distribution algorithms in both uncertainty calibration and generalization. Overall, the study opens an exciting avenue for benchmarking meaningful algorithmic progress in molecular scoring.