Understanding The Ensembl gene annotation system

The Ensembl gene annotation system is used to annotate over 70 vertebrate species and generates automatic alignment-based annotations for the human and mouse GENCODE gene sets. The system aligns biological sequences, including cDNAs, proteins, and RNA-seq reads, to the target genome to construct candidate transcript models. These models are carefully assessed and filtered to produce the final gene set, which is available on the Ensembl website. The system is based on a well-established core data flow that integrates alignments of expressed protein, cDNA, and other biological sequences. Manual curation involves evaluating biological sequences aligned to the genome to support gene structures. The system automates decision-making steps as much as possible using the same alignments. High-throughput annotation is achieved by annotating thousands of genes in parallel. The main strengths of the Ensembl annotation methods are the speed and consistency with which genome-wide annotation can be provided to the research community. These advantages will become ever more important as the number of assembled genomes and the amount of data available for each species increase due to new sequencing technologies. The Ensembl gene annotation system is used for all vertebrate species in Ensembl. When providing gene annotation on a genome assembly, the main goal is to identify a set of full-length protein-coding genes. High accuracy is achieved by a well-established core data flow that integrates alignments of expressed protein, cDNA, and other biological sequences. Manual curation involves evaluating biological sequences aligned to the genome in order to support gene structures. The evidence for each gene structure is assessed by an individual trained in genome biology, resulting in low throughput gene annotation that is especially valuable in biologically complex regions of the genome. Ensembl's approach is to automate the decision-making steps followed by manual curators, as much as they can be, using the same alignments. High-throughput annotation is achieved because thousands of genes can be annotated in parallel. The main strengths of the Ensembl annotation methods are the speed and consistency with which genome-wide annotation can be provided to the research community. These advantages will become ever more important as the number of assembled genomes and the amount of data available for each species increase due to new sequencing technologies. The Ensembl gene annotation system described by Curwen et al. was designed to annotate species with high-quality draft genome assemblies, where same-species protein sequences and full-length cDNA sequences were available as input for identifying many of the protein-coding genes. More recently, fragmented genome assemblies have become available for annotation, as have assemblies with limited availability of same-species protein or full-length cDNA sequences. For many species, RNA-seq is an additional data source available for gene annotation. To address these new challenges, our system has been extended to include methods for fast and effective annotation of assemblies that are fragmented and for which there are relatively small amounts of same-species data. Novel methods have been developed to use data from new sequencing technologies and to improve accuracy for high-coverage genomes. WeThe Ensembl gene annotation system is used to annotate over 70 vertebrate species and generates automatic alignment-based annotations for the human and mouse GENCODE gene sets. The system aligns biological sequences, including cDNAs, proteins, and RNA-seq reads, to the target genome to construct candidate transcript models. These models are carefully assessed and filtered to produce the final gene set, which is available on the Ensembl website. The system is based on a well-established core data flow that integrates alignments of expressed protein, cDNA, and other biological sequences. Manual curation involves evaluating biological sequences aligned to the genome to support gene structures. The system automates decision-making steps as much as possible using the same alignments. High-throughput annotation is achieved by annotating thousands of genes in parallel. The main strengths of the Ensembl annotation methods are the speed and consistency with which genome-wide annotation can be provided to the research community. These advantages will become ever more important as the number of assembled genomes and the amount of data available for each species increase due to new sequencing technologies. The Ensembl gene annotation system is used for all vertebrate species in Ensembl. When providing gene annotation on a genome assembly, the main goal is to identify a set of full-length protein-coding genes. High accuracy is achieved by a well-established core data flow that integrates alignments of expressed protein, cDNA, and other biological sequences. Manual curation involves evaluating biological sequences aligned to the genome in order to support gene structures. The evidence for each gene structure is assessed by an individual trained in genome biology, resulting in low throughput gene annotation that is especially valuable in biologically complex regions of the genome. Ensembl's approach is to automate the decision-making steps followed by manual curators, as much as they can be, using the same alignments. High-throughput annotation is achieved because thousands of genes can be annotated in parallel. The main strengths of the Ensembl annotation methods are the speed and consistency with which genome-wide annotation can be provided to the research community. These advantages will become ever more important as the number of assembled genomes and the amount of data available for each species increase due to new sequencing technologies. The Ensembl gene annotation system described by Curwen et al. was designed to annotate species with high-quality draft genome assemblies, where same-species protein sequences and full-length cDNA sequences were available as input for identifying many of the protein-coding genes. More recently, fragmented genome assemblies have become available for annotation, as have assemblies with limited availability of same-species protein or full-length cDNA sequences. For many species, RNA-seq is an additional data source available for gene annotation. To address these new challenges, our system has been extended to include methods for fast and effective annotation of assemblies that are fragmented and for which there are relatively small amounts of same-species data. Novel methods have been developed to use data from new sequencing technologies and to improve accuracy for high-coverage genomes. We