[slides and audio] The Ensembl genome database project

The Ensembl genome database project provides a comprehensive, stable, and automatic annotation of the human genome sequence, integrating external data sources and available as an interactive website or flat files. It is an open-source software project designed to handle large genomes and associated data from sequence analysis to data storage and visualization. Ensembl is a leading source of human genome sequence annotation and contributed significantly to the analysis of the international human genome project's draft genome. The Ensembl system is being installed globally in both companies and academic sites on machines ranging from supercomputers to laptops. Ensembl annotates known and predicted genes, using functional annotations from InterPro, OMIM, SAGE, and gene family databases. Gene prediction is crucial for connecting DNA sequences with experimental data. Ensembl uses a combination of ab initio predictions, homology, and gene prediction HMMs to identify genes. The system incorporates a three-step process to determine gene structures, including aligning known human proteins and paralogous proteins to the genome. Exons from predicted peptides are confirmed by blast matches to proteins, vertebrate mRNA, and UniGene clusters. Ensembl genes are identified by numbers beginning ENSG, with transcripts starting ENST, exons ENSE, and translations ENSP. These identifiers are stable across genome assemblies. Ensembl continuously refines its gene building process, calibrating it against regions of the genome that have been hand annotated and experimentally investigated. EST data are integrated into Ensembl gene building, with strict quality measures to ensure accuracy. The Ensembl website provides interactive views of genomic sequences, including contigview, mapview, geneview, and proteinview. It allows users to search the entire human genome sequence, predicted gene datasets, and mouse genome trace and whole genome assembly datasets using BLAST and SSAHA. Ensembl can also be accessed via the Apollo Java viewer and the FTP site, providing various data download formats. Ensembl's software system is based on a relational database and reusable components, using Bioperl and MySQL. It is written primarily in Perl with extensions in C and Java. The system is portable and allows users to install it for their own genome data processing. Ensembl's data analysis pipeline handles the dynamic scale of the human genome sequence, reanalyzing the genome whenever a new assembly becomes available. The system also supports the DAS standard, enabling users to view and compare annotations from different sources. Ensembl is a joint project of the European Bioinformatics Institute and the Sanger Centre, with funding from the Wellcome Trust and EMBL.The Ensembl genome database project provides a comprehensive, stable, and automatic annotation of the human genome sequence, integrating external data sources and available as an interactive website or flat files. It is an open-source software project designed to handle large genomes and associated data from sequence analysis to data storage and visualization. Ensembl is a leading source of human genome sequence annotation and contributed significantly to the analysis of the international human genome project's draft genome. The Ensembl system is being installed globally in both companies and academic sites on machines ranging from supercomputers to laptops. Ensembl annotates known and predicted genes, using functional annotations from InterPro, OMIM, SAGE, and gene family databases. Gene prediction is crucial for connecting DNA sequences with experimental data. Ensembl uses a combination of ab initio predictions, homology, and gene prediction HMMs to identify genes. The system incorporates a three-step process to determine gene structures, including aligning known human proteins and paralogous proteins to the genome. Exons from predicted peptides are confirmed by blast matches to proteins, vertebrate mRNA, and UniGene clusters. Ensembl genes are identified by numbers beginning ENSG, with transcripts starting ENST, exons ENSE, and translations ENSP. These identifiers are stable across genome assemblies. Ensembl continuously refines its gene building process, calibrating it against regions of the genome that have been hand annotated and experimentally investigated. EST data are integrated into Ensembl gene building, with strict quality measures to ensure accuracy. The Ensembl website provides interactive views of genomic sequences, including contigview, mapview, geneview, and proteinview. It allows users to search the entire human genome sequence, predicted gene datasets, and mouse genome trace and whole genome assembly datasets using BLAST and SSAHA. Ensembl can also be accessed via the Apollo Java viewer and the FTP site, providing various data download formats. Ensembl's software system is based on a relational database and reusable components, using Bioperl and MySQL. It is written primarily in Perl with extensions in C and Java. The system is portable and allows users to install it for their own genome data processing. Ensembl's data analysis pipeline handles the dynamic scale of the human genome sequence, reanalyzing the genome whenever a new assembly becomes available. The system also supports the DAS standard, enabling users to view and compare annotations from different sources. Ensembl is a joint project of the European Bioinformatics Institute and the Sanger Centre, with funding from the Wellcome Trust and EMBL.