Received August 15, 2005; Revised and Accepted October 20, 2005 | Feng Chen, Aaron J. Mackey, Christian J. Stoeckert Jr and David S. Roos*
The OrthoMCL database (http://orthomcl.cbil.upenn.edu) is a comprehensive resource for ortholog group predictions across 55 species, including 16 bacterial and 4 archaeal genomes, as well as most complete eukaryotic genomes. The database clusters proteins based on sequence similarity using an all-against-all BLAST search, followed by normalization and Markov clustering. A total of 511,797 proteins (81.6% of the dataset) were clustered into 70,388 ortholog groups. Users can query the database based on protein or group accession numbers, keyword descriptions, or BLAST similarity. Ortholog groups with specific phyletic patterns can be identified using a graphical interface or a text-based Phyletic Pattern Expression grammar. The database provides detailed information for each ortholog group, including phyletic profiles, member protein lists, multiple sequence alignments, statistical summaries, and graphical representations of domain architecture. The OrthoMCL software, FASTA dataset, and clustering results are available for download, and the database will be updated and expanded as more genome sequence data become available.The OrthoMCL database (http://orthomcl.cbil.upenn.edu) is a comprehensive resource for ortholog group predictions across 55 species, including 16 bacterial and 4 archaeal genomes, as well as most complete eukaryotic genomes. The database clusters proteins based on sequence similarity using an all-against-all BLAST search, followed by normalization and Markov clustering. A total of 511,797 proteins (81.6% of the dataset) were clustered into 70,388 ortholog groups. Users can query the database based on protein or group accession numbers, keyword descriptions, or BLAST similarity. Ortholog groups with specific phyletic patterns can be identified using a graphical interface or a text-based Phyletic Pattern Expression grammar. The database provides detailed information for each ortholog group, including phyletic profiles, member protein lists, multiple sequence alignments, statistical summaries, and graphical representations of domain architecture. The OrthoMCL software, FASTA dataset, and clustering results are available for download, and the database will be updated and expanded as more genome sequence data become available.