Understanding PLINDER%3A The protein-ligand interactions dataset and evaluation resource

PLINDER is a comprehensive and largest annotated dataset for protein-ligand interactions (PLI), comprising 449,383 PLI systems, each with over 500 annotations. The dataset includes various types of PLI systems, such as multi-ligand systems, oligonucleotides, peptides, and saccharides. PLINDER calculates similarity metrics at the protein, pocket, PLI, and ligand levels, enabling the measurement of diversity and detection of information leakage. The dataset also provides quality and domain information for complexes and links *holo* complexes to relevant *apo* and predicted structures. The splitting algorithm ensures diverse train and high-quality test sets, minimizing task-specific leakage and maximizing test set quality. The performance of DiffDock, a deep learning-based method, is evaluated on different splits of PLINDER, demonstrating the importance of training set size and diversity in model accuracy. The dataset and associated code are available for public use, aiming to advance the field of protein-ligand interaction prediction.PLINDER is a comprehensive and largest annotated dataset for protein-ligand interactions (PLI), comprising 449,383 PLI systems, each with over 500 annotations. The dataset includes various types of PLI systems, such as multi-ligand systems, oligonucleotides, peptides, and saccharides. PLINDER calculates similarity metrics at the protein, pocket, PLI, and ligand levels, enabling the measurement of diversity and detection of information leakage. The dataset also provides quality and domain information for complexes and links *holo* complexes to relevant *apo* and predicted structures. The splitting algorithm ensures diverse train and high-quality test sets, minimizing task-specific leakage and maximizing test set quality. The performance of DiffDock, a deep learning-based method, is evaluated on different splits of PLINDER, demonstrating the importance of training set size and diversity in model accuracy. The dataset and associated code are available for public use, aiming to advance the field of protein-ligand interaction prediction.

PLINDER: The protein-ligand interactions dataset and evaluation resource