GMAP: a genomic mapping and alignment program for mRNA and EST sequences

GMAP: a genomic mapping and alignment program for mRNA and EST sequences

February 22, 2005 | Thomas D. Wu and Colin K. Watanabe
GMAP is a genomic mapping and alignment program for mRNA and EST sequences. It efficiently maps and aligns cDNA sequences to a genome with minimal startup time and memory requirements, providing fast batch processing of large sequence sets. GMAP generates accurate gene structures even in the presence of substantial polymorphisms and sequence errors without using probabilistic splice site models. The program uses a minimal sampling strategy for genomic mapping, oligomer chaining for approximate alignment, sandwich DP for splice site detection, and microexon identification with statistical significance testing. GMAP demonstrated a several-fold increase in speed over existing programs. On a set of human messenger RNAs with random mutations at a 1 and 3% rate, GMAP identified all splice sites accurately in over 99.3% of the sequences, which was one-tenth the error rate of existing programs. On a large set of human expressed sequence tags, GMAP provided higher-quality alignments more often than BLAT did. On a set of Arabidopsis cDNAs, GMAP performed comparably with GeneSeq. GMAP is a standalone program that operates as a traditional standalone program, allowing users to map and align a single cDNA interactively against a large genome in about a second, switch arbitrarily among different genomes, run on computers with as little as 128 MB of RAM, perform high-throughput batch processing of cDNAs, generate accurate gene models, locate splice sites accurately without the use of probabilistic splice site models, detect statistically significant microexons and incorporate them into the alignment, and handle mapping and alignment tasks on genomes having alternate assemblies, linkage groups or strains. GMAP uses a minimal sampling strategy for genomic mapping, oligomer chaining for generating approximate gene structures, sandwich DP for identifying splice sites, and microexon identification with statistical significance testing. The program's methods allow it to handle certain types of alignment problems that pose challenges for existing programs. GMAP has an explicit procedure for detecting microexons and incorporating them into the alignment. The program uses a probabilistic method to identify microexons, considering only those that satisfy a calculated lower bound on the microexon length. GMAP is very conservative in applying this procedure, requiring a high-quality sequence, an adjacent canonical intron, and the remaining subsequence to match exactly to the genome. In Experiment 1, GMAP was tested on full-length human mRNAs. The Ensembl data set contains annotated exon boundaries, which were used as a gold standard. The data set contained a total of 8634 exons, some of which were extremely short. GMAP was able to accurately identify splice sites in over 99.3% of the sequences, which was one-tenth the error rate of existing programs. GMAP also performed well in identifying microexons and incorporating them into the alignment.GMAP is a genomic mapping and alignment program for mRNA and EST sequences. It efficiently maps and aligns cDNA sequences to a genome with minimal startup time and memory requirements, providing fast batch processing of large sequence sets. GMAP generates accurate gene structures even in the presence of substantial polymorphisms and sequence errors without using probabilistic splice site models. The program uses a minimal sampling strategy for genomic mapping, oligomer chaining for approximate alignment, sandwich DP for splice site detection, and microexon identification with statistical significance testing. GMAP demonstrated a several-fold increase in speed over existing programs. On a set of human messenger RNAs with random mutations at a 1 and 3% rate, GMAP identified all splice sites accurately in over 99.3% of the sequences, which was one-tenth the error rate of existing programs. On a large set of human expressed sequence tags, GMAP provided higher-quality alignments more often than BLAT did. On a set of Arabidopsis cDNAs, GMAP performed comparably with GeneSeq. GMAP is a standalone program that operates as a traditional standalone program, allowing users to map and align a single cDNA interactively against a large genome in about a second, switch arbitrarily among different genomes, run on computers with as little as 128 MB of RAM, perform high-throughput batch processing of cDNAs, generate accurate gene models, locate splice sites accurately without the use of probabilistic splice site models, detect statistically significant microexons and incorporate them into the alignment, and handle mapping and alignment tasks on genomes having alternate assemblies, linkage groups or strains. GMAP uses a minimal sampling strategy for genomic mapping, oligomer chaining for generating approximate gene structures, sandwich DP for identifying splice sites, and microexon identification with statistical significance testing. The program's methods allow it to handle certain types of alignment problems that pose challenges for existing programs. GMAP has an explicit procedure for detecting microexons and incorporating them into the alignment. The program uses a probabilistic method to identify microexons, considering only those that satisfy a calculated lower bound on the microexon length. GMAP is very conservative in applying this procedure, requiring a high-quality sequence, an adjacent canonical intron, and the remaining subsequence to match exactly to the genome. In Experiment 1, GMAP was tested on full-length human mRNAs. The Ensembl data set contains annotated exon boundaries, which were used as a gold standard. The data set contained a total of 8634 exons, some of which were extremely short. GMAP was able to accurately identify splice sites in over 99.3% of the sequences, which was one-tenth the error rate of existing programs. GMAP also performed well in identifying microexons and incorporating them into the alignment.
Reach us at info@study.space
[slides] GMAP%3A a genomic mapping and alignment program for mRNA and EST sequence | StudySpace