[slides and audio] SignalP 5.0 improves signal peptide predictions using deep neural networks

SignalP 5.0 is a deep neural network-based method that improves the prediction of signal peptides (SPs) in proteins. The method uses a deep recurrent neural network (RNN) architecture, combined with Conditional Random Field (CRF) classification and optimized transfer learning. This approach allows for better recognition of sequence motifs, including SPs, compared to traditional feed-forward neural networks. The CRF imposes a defined grammar on the prediction, eliminating the need for post-processing steps used in earlier versions of SignalP. Transfer learning enables the model to perform well even on small datasets, such as those for archaeal sequences. SignalP 5.0 distinguishes three types of prokaryotic SPs: Sec substrates cleaved by SPase I (Sec/SPI), Sec substrates cleaved by SPase II (Sec/SPII), and Tat substrates cleaved by SPase I (Tat/SPI). It cannot identify Tat substrates cleaved by SPase II, although these are known to exist. The model also cannot identify SPase III processed Sec substrates due to a lack of sufficient training data. SignalP 5.0 was trained and tested on four groups of organisms (Eukaryotes, Archaea, Gram-positive bacteria, and Gram-negative bacteria) and four types of proteins. The training data consisted of 20,758 proteins. The model was benchmarked against 18 SP prediction algorithms, including Signal-BLAST, which was excluded due to its artificially high performance. The model achieved high performance metrics, including a Matthews Correlation Coefficient (MCC) of 0.938 for Archaea, 0.907 for Gram-negative bacteria, 0.890 for Gram-positive bacteria, and 0.966 for Eukaryotes. SignalP 5.0 outperformed other methods in several benchmarks, particularly in the Sec/SPI benchmark, where it ranked second only to SignalP 4.1 for Gram-positive bacteria. It also had the highest CS recall in Eukaryotes and Bacteria, and the second highest CS recall in Archaea. The model achieved the highest CS precision across all organisms compared to existing methods. SignalP 5.0 can predict proteome-wide SPs across all organisms and classify them into Sec/SPI, Sec/SPII, and Tat/SPI SPs, often better than specialized predictors. The model was tested on two well-annotated reference proteomes, Escherichia coli and Saccharomyces cerevisiae, and accurately detected all but one experimentally verified Sec/SPI SPs. The model also identified potentially new SPs with high probability, which may be interesting candidates for verification.SignalP 5.0 is a deep neural network-based method that improves the prediction of signal peptides (SPs) in proteins. The method uses a deep recurrent neural network (RNN) architecture, combined with Conditional Random Field (CRF) classification and optimized transfer learning. This approach allows for better recognition of sequence motifs, including SPs, compared to traditional feed-forward neural networks. The CRF imposes a defined grammar on the prediction, eliminating the need for post-processing steps used in earlier versions of SignalP. Transfer learning enables the model to perform well even on small datasets, such as those for archaeal sequences. SignalP 5.0 distinguishes three types of prokaryotic SPs: Sec substrates cleaved by SPase I (Sec/SPI), Sec substrates cleaved by SPase II (Sec/SPII), and Tat substrates cleaved by SPase I (Tat/SPI). It cannot identify Tat substrates cleaved by SPase II, although these are known to exist. The model also cannot identify SPase III processed Sec substrates due to a lack of sufficient training data. SignalP 5.0 was trained and tested on four groups of organisms (Eukaryotes, Archaea, Gram-positive bacteria, and Gram-negative bacteria) and four types of proteins. The training data consisted of 20,758 proteins. The model was benchmarked against 18 SP prediction algorithms, including Signal-BLAST, which was excluded due to its artificially high performance. The model achieved high performance metrics, including a Matthews Correlation Coefficient (MCC) of 0.938 for Archaea, 0.907 for Gram-negative bacteria, 0.890 for Gram-positive bacteria, and 0.966 for Eukaryotes. SignalP 5.0 outperformed other methods in several benchmarks, particularly in the Sec/SPI benchmark, where it ranked second only to SignalP 4.1 for Gram-positive bacteria. It also had the highest CS recall in Eukaryotes and Bacteria, and the second highest CS recall in Archaea. The model achieved the highest CS precision across all organisms compared to existing methods. SignalP 5.0 can predict proteome-wide SPs across all organisms and classify them into Sec/SPI, Sec/SPII, and Tat/SPI SPs, often better than specialized predictors. The model was tested on two well-annotated reference proteomes, Escherichia coli and Saccharomyces cerevisiae, and accurately detected all but one experimentally verified Sec/SPI SPs. The model also identified potentially new SPs with high probability, which may be interesting candidates for verification.

SignalP 5.0 improves signal peptide predictions using deep neural networks

2019 | Armenteros, José Juan Almagro; Tsirigos, Konstantinos; Sønderby, Casper Kaae; Petersen, Thomas Nordahl; Winther, Ole; Brunak, Søren; von Heijne, Gunnar; Nielsen, Henrik