2011 | Michael F. Lin, Irwin Jungreis, Manolis Kellis
PhyloCSF is a comparative genomics method for distinguishing protein-coding and non-coding regions. It uses a multispecies nucleotide sequence alignment and applies a formal statistical comparison of phylogenetic codon models to determine if the alignment is likely to represent a conserved protein-coding region. The method outperforms previous methods in classification performance, particularly in 12-species Drosophila genome alignments. PhyloCSF is applicable for assessing the coding potential of transcript models or individual exons in an assembled genome that can be aligned to one or more informant genomes at appropriate phylogenetic distances. It relies on a phylogenetic framework and produces meaningful likelihood ratios as its output. The method uses empirical codon models (ECMs) based on several thousand parameters to model the rates of codon substitutions in coding and non-coding regions. PhyloCSF also takes advantage of genome-wide training data to provide prior information about the branch lengths in the phylogenetic tree and the codon frequencies. The method is implemented in Objective Caml and is freely available for GNU/Linux and Mac OS X. PhyloCSF has been applied in fungi, flies, and mammals to identify novel coding genes and to evaluate the coding potential of transcripts. It has shown superior performance in distinguishing coding and non-coding regions, particularly in short exons. The method is computationally demanding but provides a more theoretically sound approach compared to previous methods. PhyloCSF is a valuable tool for genome annotation based on mRNA-Seq data and can contribute to new strategies for identifying protein-coding and non-coding regions.PhyloCSF is a comparative genomics method for distinguishing protein-coding and non-coding regions. It uses a multispecies nucleotide sequence alignment and applies a formal statistical comparison of phylogenetic codon models to determine if the alignment is likely to represent a conserved protein-coding region. The method outperforms previous methods in classification performance, particularly in 12-species Drosophila genome alignments. PhyloCSF is applicable for assessing the coding potential of transcript models or individual exons in an assembled genome that can be aligned to one or more informant genomes at appropriate phylogenetic distances. It relies on a phylogenetic framework and produces meaningful likelihood ratios as its output. The method uses empirical codon models (ECMs) based on several thousand parameters to model the rates of codon substitutions in coding and non-coding regions. PhyloCSF also takes advantage of genome-wide training data to provide prior information about the branch lengths in the phylogenetic tree and the codon frequencies. The method is implemented in Objective Caml and is freely available for GNU/Linux and Mac OS X. PhyloCSF has been applied in fungi, flies, and mammals to identify novel coding genes and to evaluate the coding potential of transcripts. It has shown superior performance in distinguishing coding and non-coding regions, particularly in short exons. The method is computationally demanding but provides a more theoretically sound approach compared to previous methods. PhyloCSF is a valuable tool for genome annotation based on mRNA-Seq data and can contribute to new strategies for identifying protein-coding and non-coding regions.