Support vector machine approach for protein subcellular localization prediction

Support vector machine approach for protein subcellular localization prediction

December 12, 2000; revised on March 28, 2001; accepted on April 24, 2001 | Sujun Hua and Zhirong Sun
This paper presents a Support Vector Machine (SVM) approach for predicting the subcellular localization of proteins based on their amino acid compositions. The method achieves high prediction accuracy, with 91.4% for three subcellular locations in prokaryotic organisms and 79.4% for four locations in eukaryotic organisms. The SVM approach is robust to errors in the protein N-terminal sequences and outperforms existing algorithms based on amino acid composition. A web server implementing this method is available at http://www.bioinfo.tsinghua.edu.cn/SubLoc/. The study compares the SVM method with other prediction methods, including neural networks and the covariant discriminant algorithm. The SVM method shows superior performance, with total accuracy for prokaryotic sequences being about 10% higher than that of the neural network method and about 5% higher than that of the covariant discriminant algorithm. For eukaryotic sequences, the total accuracy was 13% higher than that of the neural network method. The SVM method also outperforms the Markov chain model, achieving 2.3% and 6.4% higher accuracy for prokaryotic and eukaryotic sequences, respectively. The SVM method is also robust to errors in the N-terminal sequence annotation. Even when up to 40 amino acid segments were removed from the N-terminal sequence, the total accuracy was only reduced by 1.2% for prokaryotic sequences and 3% for eukaryotic sequences. This indicates that the SVM method is more reliable than methods based on sorting signals when the N-terminal sequence is incomplete. The SVM method condenses information in the training samples to provide a sparse representation using a small number of samples, the Support Vectors (SVs). The number of SVs is relatively small compared to the total number of training samples, making the method efficient and suitable for large datasets. The SVM method can also incorporate other informative features, which may improve the prediction accuracy. In conclusion, the SVM approach provides superior prediction performance compared with existing algorithms based on amino acid composition and can be a complementary method to other existing methods based on sorting signals. The method is robust to errors in the N-terminal sequence and is a useful tool for the large-scale analysis of genome data.This paper presents a Support Vector Machine (SVM) approach for predicting the subcellular localization of proteins based on their amino acid compositions. The method achieves high prediction accuracy, with 91.4% for three subcellular locations in prokaryotic organisms and 79.4% for four locations in eukaryotic organisms. The SVM approach is robust to errors in the protein N-terminal sequences and outperforms existing algorithms based on amino acid composition. A web server implementing this method is available at http://www.bioinfo.tsinghua.edu.cn/SubLoc/. The study compares the SVM method with other prediction methods, including neural networks and the covariant discriminant algorithm. The SVM method shows superior performance, with total accuracy for prokaryotic sequences being about 10% higher than that of the neural network method and about 5% higher than that of the covariant discriminant algorithm. For eukaryotic sequences, the total accuracy was 13% higher than that of the neural network method. The SVM method also outperforms the Markov chain model, achieving 2.3% and 6.4% higher accuracy for prokaryotic and eukaryotic sequences, respectively. The SVM method is also robust to errors in the N-terminal sequence annotation. Even when up to 40 amino acid segments were removed from the N-terminal sequence, the total accuracy was only reduced by 1.2% for prokaryotic sequences and 3% for eukaryotic sequences. This indicates that the SVM method is more reliable than methods based on sorting signals when the N-terminal sequence is incomplete. The SVM method condenses information in the training samples to provide a sparse representation using a small number of samples, the Support Vectors (SVs). The number of SVs is relatively small compared to the total number of training samples, making the method efficient and suitable for large datasets. The SVM method can also incorporate other informative features, which may improve the prediction accuracy. In conclusion, the SVM approach provides superior prediction performance compared with existing algorithms based on amino acid composition and can be a complementary method to other existing methods based on sorting signals. The method is robust to errors in the N-terminal sequence and is a useful tool for the large-scale analysis of genome data.
Reach us at info@study.space