PLINDER: The protein-ligand interactions dataset and evaluation resource

PLINDER: The protein-ligand interactions dataset and evaluation resource

July 19, 2024 | Janani Durairaj, Yusuf Adeshina, Zhonglin Cao, Xuejin Zhang, Vladas Oleinikovas, Thomas Duigan, Zachary McClure, Xavier Robin, Gabriel Studer, Daniel Kovtun, Emanuele Rossi, Guoqing Zhou, Srimukh Veecbam, Clemens Isert, Yuxing Peng, Prabindh Sundareson, Mehmet Akdel, Gabriele Corso, Hannes Stärk, Gerardo Tauriello, Zachary Carpenter, Michael Bronstein, Emine Kucukbenli, Torsten Schwede, Luca Naef
PLINDER is a large, comprehensive dataset of protein-ligand interactions (PLI) designed to support the development and evaluation of deep learning methods in drug discovery and protein engineering. It contains 449,383 PLI systems, each with over 500 annotations, similarity metrics at multiple levels, and paired unbound (apo) and predicted structures. The dataset is curated from the Protein Data Bank (PDB) and includes a wide range of systems, such as multi-ligand, oligonucleotide, peptide, and saccharide complexes. PLINDER provides detailed annotations for quality, domain information, and similarity metrics, enabling the measurement of diversity and detection of information leakage. It also links holo complexes to relevant apo and predicted structures, facilitating realistic inference scenarios. To ensure the dataset's reliability and effectiveness, PLINDER employs a splitting algorithm that minimizes task-specific leakage and maximizes test set quality. The algorithm generates training and evaluation splits that avoid information leakage between train and test sets, ensuring that the test set is diverse and representative. PLINDER also includes a variety of splits, such as PLINDER-PL50, PLINDER-ECOD, and PLINDER-TIME, each designed to address different aspects of data leakage and test set quality. The dataset is used to evaluate the performance of deep learning models, such as DiffDock, on various tasks, including rigid body docking, flexible pocket docking, co-folding, and ligand-conditioned protein engineering. The results show that the quality of the test set significantly impacts model performance, with high-quality test sets leading to more accurate predictions. PLINDER also provides a benchmark for evaluating the effectiveness of different splitting strategies and the importance of similarity metrics in reducing leakage. The dataset is publicly available under a CC-BY 4.0 license and includes a wide range of annotations and similarity metrics. It is designed to support the development of new methods for protein-ligand interaction prediction and to provide a reliable resource for researchers in the field. PLINDER aims to advance the development of novel drug discovery and protein engineering approaches by providing a robust and reliable dataset for training and evaluating deep learning-based prediction methods.PLINDER is a large, comprehensive dataset of protein-ligand interactions (PLI) designed to support the development and evaluation of deep learning methods in drug discovery and protein engineering. It contains 449,383 PLI systems, each with over 500 annotations, similarity metrics at multiple levels, and paired unbound (apo) and predicted structures. The dataset is curated from the Protein Data Bank (PDB) and includes a wide range of systems, such as multi-ligand, oligonucleotide, peptide, and saccharide complexes. PLINDER provides detailed annotations for quality, domain information, and similarity metrics, enabling the measurement of diversity and detection of information leakage. It also links holo complexes to relevant apo and predicted structures, facilitating realistic inference scenarios. To ensure the dataset's reliability and effectiveness, PLINDER employs a splitting algorithm that minimizes task-specific leakage and maximizes test set quality. The algorithm generates training and evaluation splits that avoid information leakage between train and test sets, ensuring that the test set is diverse and representative. PLINDER also includes a variety of splits, such as PLINDER-PL50, PLINDER-ECOD, and PLINDER-TIME, each designed to address different aspects of data leakage and test set quality. The dataset is used to evaluate the performance of deep learning models, such as DiffDock, on various tasks, including rigid body docking, flexible pocket docking, co-folding, and ligand-conditioned protein engineering. The results show that the quality of the test set significantly impacts model performance, with high-quality test sets leading to more accurate predictions. PLINDER also provides a benchmark for evaluating the effectiveness of different splitting strategies and the importance of similarity metrics in reducing leakage. The dataset is publicly available under a CC-BY 4.0 license and includes a wide range of annotations and similarity metrics. It is designed to support the development of new methods for protein-ligand interaction prediction and to provide a reliable resource for researchers in the field. PLINDER aims to advance the development of novel drug discovery and protein engineering approaches by providing a robust and reliable dataset for training and evaluating deep learning-based prediction methods.
Reach us at info@futurestudyspace.com
[slides] PLINDER%3A The protein-ligand interactions dataset and evaluation resource | StudySpace