CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model

CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model

2013 | Liguo Wang, Hyun Jung Park, Surendra Dasari, Shengqin Wang, Jean-Pierre Kocher, and Wei Li
CPAT is an alignment-free method for assessing the coding potential of transcripts using a logistic regression model based on four sequence features: open reading frame (ORF) size, ORF coverage, Fickett TESTCODE statistic, and hexamer usage bias. The method outperforms existing alignment-based tools such as Coding-Potential Calculator (CPC) and Phylo Codon Substitution Frequencies (PhyloCSF) in terms of sensitivity and specificity, with CPAT achieving 0.96 and 0.97 respectively, while CPC and PhyloCSF have 0.99 and 0.74, and 0.90 and 0.63. CPAT is also significantly faster, approximately 10,000 times faster than CPC and PhyloCSF. The software accepts input sequences in FASTA or BED formats and provides a web interface for users to submit sequences and receive predictions instantly. The method is particularly useful for distinguishing between coding and noncoding transcripts, especially long noncoding RNAs (lncRNAs), which are often lineage-specific and less conserved. Alignment-based methods are limited in their ability to accurately classify lncRNAs due to their low conservation and potential overlap with protein-coding genes. CPAT uses four sequence features to predict coding potential, with the Fickett score being the most accurate, achieving 94% sensitivity and 97% specificity when the test region is at least 200 nt long. The hexamer score, based on the log-likelihood ratio of hexamer usage between coding and noncoding sequences, is also a strong predictor. CPAT was evaluated using a training set of 10,000 coding and 10,000 noncoding transcripts, and a test set of 8,000 genes (4,000 coding and 4,000 noncoding). The model achieved an AUC of 0.9927 and high accuracy, with a score threshold of 0.364 providing the highest sensitivity and specificity. CPAT outperformed CPC, PhyloCSF, and PORTRAIT in terms of sensitivity and specificity, with the highest overall accuracy. The method is also computationally efficient, with CPAT processing 200 sequences in 0.67 seconds, significantly faster than CPC and PhyloCSF. CPAT is a robust, accurate, and efficient tool for distinguishing coding from noncoding transcripts, particularly useful for analyzing large transcriptomes. It is freely available and can be accessed via a web interface, making it accessible to a wide audience. The method's alignment-free approach allows it to handle a broader range of transcripts, including those that are not well conserved or difficult to align.CPAT is an alignment-free method for assessing the coding potential of transcripts using a logistic regression model based on four sequence features: open reading frame (ORF) size, ORF coverage, Fickett TESTCODE statistic, and hexamer usage bias. The method outperforms existing alignment-based tools such as Coding-Potential Calculator (CPC) and Phylo Codon Substitution Frequencies (PhyloCSF) in terms of sensitivity and specificity, with CPAT achieving 0.96 and 0.97 respectively, while CPC and PhyloCSF have 0.99 and 0.74, and 0.90 and 0.63. CPAT is also significantly faster, approximately 10,000 times faster than CPC and PhyloCSF. The software accepts input sequences in FASTA or BED formats and provides a web interface for users to submit sequences and receive predictions instantly. The method is particularly useful for distinguishing between coding and noncoding transcripts, especially long noncoding RNAs (lncRNAs), which are often lineage-specific and less conserved. Alignment-based methods are limited in their ability to accurately classify lncRNAs due to their low conservation and potential overlap with protein-coding genes. CPAT uses four sequence features to predict coding potential, with the Fickett score being the most accurate, achieving 94% sensitivity and 97% specificity when the test region is at least 200 nt long. The hexamer score, based on the log-likelihood ratio of hexamer usage between coding and noncoding sequences, is also a strong predictor. CPAT was evaluated using a training set of 10,000 coding and 10,000 noncoding transcripts, and a test set of 8,000 genes (4,000 coding and 4,000 noncoding). The model achieved an AUC of 0.9927 and high accuracy, with a score threshold of 0.364 providing the highest sensitivity and specificity. CPAT outperformed CPC, PhyloCSF, and PORTRAIT in terms of sensitivity and specificity, with the highest overall accuracy. The method is also computationally efficient, with CPAT processing 200 sequences in 0.67 seconds, significantly faster than CPC and PhyloCSF. CPAT is a robust, accurate, and efficient tool for distinguishing coding from noncoding transcripts, particularly useful for analyzing large transcriptomes. It is freely available and can be accessed via a web interface, making it accessible to a wide audience. The method's alignment-free approach allows it to handle a broader range of transcripts, including those that are not well conserved or difficult to align.
Reach us at info@study.space
[slides] CPAT%3A Coding-Potential Assessment Tool using an alignment-free logistic regression model | StudySpace