Prediction of blood–brain barrier penetrating peptides based on data augmentation with Augur

Prediction of blood–brain barrier penetrating peptides based on data augmentation with Augur

2024 | Zhi-Feng Gu, Yu-Duo Hao, Tian-Yu Wang, Pei-Ling Cai, Yang Zhang, Ke-Jun Deng, Hao Lin, Hao Lv
This study presents Augur, a novel prediction model for identifying blood–brain barrier (BBB) penetrating peptides (B3PPs) using data augmentation and machine learning. The model addresses the challenges of limited positive data and imbalanced datasets, which have hindered the performance of previous prediction models. Augur employs borderline-SMOTE-based data augmentation to enhance the sample size and balance the dataset, and it extracts highly interpretable physicochemical properties of B3PPs. The experimental results demonstrate that Augur achieves superior prediction performance, with an AUC value of 0.932 on the training set and 0.931 on the independent test set. The study also analyzes the amino acid composition and feature extraction methods, comparing different feature encoding and selection techniques. The optimal feature set, selected using the Information Gain (IG) method, consists of 383 features, which significantly improve the model's performance. Additionally, the study evaluates various machine learning algorithms, finding that Random Forest (RF) performs best. The superior predictive ability of Augur is attributed to its effective handling of imbalanced datasets and optimal data augmentation ratio. This breakthrough has significant implications for drug development targeting neurological disorders, enhancing the efficiency of peptide-based drug discovery and paving the way for innovative treatment strategies for central nervous system diseases.This study presents Augur, a novel prediction model for identifying blood–brain barrier (BBB) penetrating peptides (B3PPs) using data augmentation and machine learning. The model addresses the challenges of limited positive data and imbalanced datasets, which have hindered the performance of previous prediction models. Augur employs borderline-SMOTE-based data augmentation to enhance the sample size and balance the dataset, and it extracts highly interpretable physicochemical properties of B3PPs. The experimental results demonstrate that Augur achieves superior prediction performance, with an AUC value of 0.932 on the training set and 0.931 on the independent test set. The study also analyzes the amino acid composition and feature extraction methods, comparing different feature encoding and selection techniques. The optimal feature set, selected using the Information Gain (IG) method, consists of 383 features, which significantly improve the model's performance. Additionally, the study evaluates various machine learning algorithms, finding that Random Forest (RF) performs best. The superior predictive ability of Augur is attributed to its effective handling of imbalanced datasets and optimal data augmentation ratio. This breakthrough has significant implications for drug development targeting neurological disorders, enhancing the efficiency of peptide-based drug discovery and paving the way for innovative treatment strategies for central nervous system diseases.
Reach us at info@study.space