MISATO: machine learning dataset of protein–ligand complexes for structure-based drug discovery

MISATO: machine learning dataset of protein–ligand complexes for structure-based drug discovery

10 May 2024 | Till Siebenmorgen, Filipe Menezes, Sabrina Benassou, Erinc Merdivan, Kieran Didi, André Santos Dias Mourão, Radostaw Kiteł, Pietro Liò, Stefan Kesselheim, Marie Piraud, Fabian J. Theis, Michael Sattler & Grzegorz M. Popowicz
MISATO is a machine learning dataset of protein–ligand complexes for structure-based drug discovery. It combines quantum mechanical properties of small molecules with molecular dynamics simulations of ~20,000 experimental protein–ligand complexes, providing extensive validation of experimental data. Starting from experimental structures, semi-empirical quantum mechanics was used to systematically refine these structures. A large collection of molecular dynamics traces of protein–ligand complexes in explicit water is included, accumulating over 170 μs. Examples of machine learning (ML) baseline models are provided, showing improved accuracy by employing the data. An easy entry point for ML experts is provided to enable the next generation of drug discovery AI models. MISATO is based on experimental protein–ligand structures and includes quantum-chemical-based structural curation and refinement, including regularization of the ligand geometry. It augments the database with missing dynamic and chemical information, including MD on a timescale allowing the detection of transient and cryptic states for certain systems. The latter are very important for successful drug design. Thus, the database supplements experimental data with the maximum number of physical parameters, easing the burden on AI models to implicitly learn all this information, allowing focus on the main learning task. The MISATO database provides a user-friendly format that can be directly imported into machine learning (ML) codes. Various preprocessing scripts are provided to filter and visualize the dataset. Example AI baseline models are supplied for the calculation of quantum chemical properties, binding affinity calculation, and prediction of protein flexibility or induced-fit features. The dataset includes extensive experimental validation of QM calculations, MD trajectories, and AI baseline models. It is validated on experimental data, showing high correlations between predicted and actual values. The dataset is publicly accessible and can be downloaded from Zenodo. The code is available from the GitHub repository and on Zenodo. The dataset is accessible via a Python interface using a simple PyTorch data loader. Special attention was given to code modularity, making it easy to adjust the AI architecture. The dataset is built from the PDBbind database (release 2022). Source data are provided with this paper.MISATO is a machine learning dataset of protein–ligand complexes for structure-based drug discovery. It combines quantum mechanical properties of small molecules with molecular dynamics simulations of ~20,000 experimental protein–ligand complexes, providing extensive validation of experimental data. Starting from experimental structures, semi-empirical quantum mechanics was used to systematically refine these structures. A large collection of molecular dynamics traces of protein–ligand complexes in explicit water is included, accumulating over 170 μs. Examples of machine learning (ML) baseline models are provided, showing improved accuracy by employing the data. An easy entry point for ML experts is provided to enable the next generation of drug discovery AI models. MISATO is based on experimental protein–ligand structures and includes quantum-chemical-based structural curation and refinement, including regularization of the ligand geometry. It augments the database with missing dynamic and chemical information, including MD on a timescale allowing the detection of transient and cryptic states for certain systems. The latter are very important for successful drug design. Thus, the database supplements experimental data with the maximum number of physical parameters, easing the burden on AI models to implicitly learn all this information, allowing focus on the main learning task. The MISATO database provides a user-friendly format that can be directly imported into machine learning (ML) codes. Various preprocessing scripts are provided to filter and visualize the dataset. Example AI baseline models are supplied for the calculation of quantum chemical properties, binding affinity calculation, and prediction of protein flexibility or induced-fit features. The dataset includes extensive experimental validation of QM calculations, MD trajectories, and AI baseline models. It is validated on experimental data, showing high correlations between predicted and actual values. The dataset is publicly accessible and can be downloaded from Zenodo. The code is available from the GitHub repository and on Zenodo. The dataset is accessible via a Python interface using a simple PyTorch data loader. Special attention was given to code modularity, making it easy to adjust the AI architecture. The dataset is built from the PDBbind database (release 2022). Source data are provided with this paper.
Reach us at info@futurestudyspace.com
[slides] MISATO%3A machine learning dataset of protein%E2%80%93ligand complexes for structure-based drug discovery | StudySpace