2009 | Albert J. Vilella, Jessica Severin, Abel Ureta-Vidal, Li Heng, Richard Durbin, and Ewan Birney
EnsemblCompara GeneTrees is a comprehensive gene-oriented phylogenetic resource that provides complete, duplication-aware phylogenetic trees for vertebrates. It is built using a computational pipeline that handles clustering, multiple alignment, and tree generation, including large gene families. The resource includes two novel non-sequence-based metrics for assessing gene tree correctness and benchmarks various tree methods. The TreeBeST method from TreeFam performs best in this study. It is also compared to clustering approaches for ortholog prediction, showing improved coverage with the phylogenetic approach.
The resource includes phylogenetic trees for 91% of vertebrate genes and includes non-vertebrate species as outgroups. It provides data in multiple formats and is updated with the Ensembl project. The paper discusses the motivation, implementation, and benchmarking of the method, as well as display and access methods for the trees.
The paper compares different methods for generating orthology descriptions, including Inparanoid, MSOAR, OrthoMCL, HomoloGene, TreeFam, PhyOP, and PhiGs. TreeFam provides an explicit gene tree across multiple species using various measures, including dS, dN, nucleotide and protein distance, and a species tree to balance duplications and deletions. The Ensembl gene trees use TreeBeST, which integrates multiple tree topologies and penalizes topologies inconsistent with known species relationships.
The paper evaluates the performance of TreeBeST and PhyML in vertebrates, comparing them to basic best reciprocal hit (BRH) methods and cluster frameworks. It also benchmarks against a recent PhyOP dataset. The results show that TreeBeST produces trees more consistent with synteny relationships and less anomalous topologies than single protein-based methods.
The paper presents a robust, computationally efficient pipeline for gene tree generation, including steps for protein data sets, BLASTP all vs. all, graph construction, clustering, multiple alignments, gene tree and reconciliation, ortholog and paralog inference, and dN/dS calculations. The pipeline is fault-tolerant and allows for hierarchical breakdown of clusters to generate sensible trees for complex families.
The paper also presents two metrics for assessing the quality of gene trees: duplication consistency score and gene synteny metric. The duplication consistency score measures the consistency of duplication events, while the gene synteny metric assesses the conservation of gene order and orientation across species. The results show that TreeBeST performs better than other methods in these metrics.
The paper compares the EnsemblCompara GeneTrees with other orthology sets, showing better coverage and accuracy. It also discusses the use of the synteny metric to assess the specificity of different methods. The results show that EnsemblCompara performs better in terms of coverage and specificity for human and mouse genes.
The paper also discusses the display and access of orthologs, including web display, projection of GO terms via orthology links, and data mining using BioMartEnsemblCompara GeneTrees is a comprehensive gene-oriented phylogenetic resource that provides complete, duplication-aware phylogenetic trees for vertebrates. It is built using a computational pipeline that handles clustering, multiple alignment, and tree generation, including large gene families. The resource includes two novel non-sequence-based metrics for assessing gene tree correctness and benchmarks various tree methods. The TreeBeST method from TreeFam performs best in this study. It is also compared to clustering approaches for ortholog prediction, showing improved coverage with the phylogenetic approach.
The resource includes phylogenetic trees for 91% of vertebrate genes and includes non-vertebrate species as outgroups. It provides data in multiple formats and is updated with the Ensembl project. The paper discusses the motivation, implementation, and benchmarking of the method, as well as display and access methods for the trees.
The paper compares different methods for generating orthology descriptions, including Inparanoid, MSOAR, OrthoMCL, HomoloGene, TreeFam, PhyOP, and PhiGs. TreeFam provides an explicit gene tree across multiple species using various measures, including dS, dN, nucleotide and protein distance, and a species tree to balance duplications and deletions. The Ensembl gene trees use TreeBeST, which integrates multiple tree topologies and penalizes topologies inconsistent with known species relationships.
The paper evaluates the performance of TreeBeST and PhyML in vertebrates, comparing them to basic best reciprocal hit (BRH) methods and cluster frameworks. It also benchmarks against a recent PhyOP dataset. The results show that TreeBeST produces trees more consistent with synteny relationships and less anomalous topologies than single protein-based methods.
The paper presents a robust, computationally efficient pipeline for gene tree generation, including steps for protein data sets, BLASTP all vs. all, graph construction, clustering, multiple alignments, gene tree and reconciliation, ortholog and paralog inference, and dN/dS calculations. The pipeline is fault-tolerant and allows for hierarchical breakdown of clusters to generate sensible trees for complex families.
The paper also presents two metrics for assessing the quality of gene trees: duplication consistency score and gene synteny metric. The duplication consistency score measures the consistency of duplication events, while the gene synteny metric assesses the conservation of gene order and orientation across species. The results show that TreeBeST performs better than other methods in these metrics.
The paper compares the EnsemblCompara GeneTrees with other orthology sets, showing better coverage and accuracy. It also discusses the use of the synteny metric to assess the specificity of different methods. The results show that EnsemblCompara performs better in terms of coverage and specificity for human and mouse genes.
The paper also discusses the display and access of orthologs, including web display, projection of GO terms via orthology links, and data mining using BioMart