2024 | Zhi-Feng Gu, Yu-Duo Hao, Tian-Yu Wang, Pei-Ling Cai, Yang Zhang, Ke-Jun Deng, Hao Lin and Hao Lv
This study introduces Augur, a novel machine learning model for predicting blood-brain barrier (BBB) penetrating peptides (B3PPs) using data augmentation and feature selection. The model addresses the challenges of limited positive data and class imbalance by applying borderline-SMOTE oversampling and random under-sampling. It extracts interpretable physicochemical features of B3PPs and combines them with machine learning algorithms to improve prediction accuracy. The model was evaluated on a benchmark dataset of 269 B3PPs and 2690 non-B3PPs sequences. The results showed that Augur achieved an AUC of 0.932 on the training set and 0.931 on the independent test set, outperforming existing prediction models such as B3Pred, MIMML, and SCMB3PP. The model's performance was further enhanced by optimizing feature selection using information gain (IG) and selecting the optimal number of features through ternary search. The study also compared different machine learning algorithms (RF, LightGBM, LR, SVM, KNN) and found that LightGBM and RF performed best. The results indicate that Augur provides a more accurate and efficient method for predicting B3PPs, which is crucial for developing drugs targeting neurological disorders. The model's success is attributed to its ability to handle imbalanced data and extract meaningful features that contribute to BBB penetration. This advancement may improve the efficiency of peptide-based drug discovery and pave the way for innovative treatment strategies for central nervous system diseases.This study introduces Augur, a novel machine learning model for predicting blood-brain barrier (BBB) penetrating peptides (B3PPs) using data augmentation and feature selection. The model addresses the challenges of limited positive data and class imbalance by applying borderline-SMOTE oversampling and random under-sampling. It extracts interpretable physicochemical features of B3PPs and combines them with machine learning algorithms to improve prediction accuracy. The model was evaluated on a benchmark dataset of 269 B3PPs and 2690 non-B3PPs sequences. The results showed that Augur achieved an AUC of 0.932 on the training set and 0.931 on the independent test set, outperforming existing prediction models such as B3Pred, MIMML, and SCMB3PP. The model's performance was further enhanced by optimizing feature selection using information gain (IG) and selecting the optimal number of features through ternary search. The study also compared different machine learning algorithms (RF, LightGBM, LR, SVM, KNN) and found that LightGBM and RF performed best. The results indicate that Augur provides a more accurate and efficient method for predicting B3PPs, which is crucial for developing drugs targeting neurological disorders. The model's success is attributed to its ability to handle imbalanced data and extract meaningful features that contribute to BBB penetration. This advancement may improve the efficiency of peptide-based drug discovery and pave the way for innovative treatment strategies for central nervous system diseases.