PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions

PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions

Vol. 27 ISMB 2011 | Michael F. Lin, Irwin Jungreis, Manolis Kellis
PhyloCSF is a novel comparative genomics method designed to distinguish protein-coding and non-coding regions in genomic sequences. The method leverages multispecies nucleotide sequence alignments to determine whether a given alignment is likely to represent a conserved protein-coding region through a statistical comparison of phylogenetic codon models. The authors demonstrate that PhyloCSF outperforms other methods, including the previously benchmarked Codon Substitution Frequencies (CSF) metric, in classifying regions in 12-species Drosophila genome alignments. PhyloCSF uses empirical codon models (ECMs) with thousands of parameters to model the rates of codon substitutions in coding and non-coding regions, providing more detailed and informative features compared to the standard dN/dS test. The method also incorporates genome-wide training data to estimate branch lengths and codon frequencies, enhancing its accuracy. PhyloCSF's performance is further improved by a length-based transformation of the log-likelihood ratio score, which accounts for the non-independence of codon sites in alignments. The software is freely available and has been applied to various species, including fission yeast, Drosophila, and mammals, demonstrating its utility in transcriptome annotation and the identification of novel coding genes and long non-coding RNAs.PhyloCSF is a novel comparative genomics method designed to distinguish protein-coding and non-coding regions in genomic sequences. The method leverages multispecies nucleotide sequence alignments to determine whether a given alignment is likely to represent a conserved protein-coding region through a statistical comparison of phylogenetic codon models. The authors demonstrate that PhyloCSF outperforms other methods, including the previously benchmarked Codon Substitution Frequencies (CSF) metric, in classifying regions in 12-species Drosophila genome alignments. PhyloCSF uses empirical codon models (ECMs) with thousands of parameters to model the rates of codon substitutions in coding and non-coding regions, providing more detailed and informative features compared to the standard dN/dS test. The method also incorporates genome-wide training data to estimate branch lengths and codon frequencies, enhancing its accuracy. PhyloCSF's performance is further improved by a length-based transformation of the log-likelihood ratio score, which accounts for the non-independence of codon sites in alignments. The software is freely available and has been applied to various species, including fission yeast, Drosophila, and mammals, demonstrating its utility in transcriptome annotation and the identification of novel coding genes and long non-coding RNAs.
Reach us at info@study.space