2008 | Kay Prüfer, Udo Stenzel, Michael Dannemann, Richard E. Green, Michael Lachmann and Janet Kelso
PatMaN is a tool for rapidly aligning short nucleotide sequences to large databases, allowing for a predefined number of gaps and mismatches. It uses a non-deterministic automata matching algorithm on a keyword tree of the search strings. The program is command-line driven and can handle both queries with and without ambiguity codes. Search time is short for perfect matches, but retrieval time increases exponentially with the number of allowed edits.
PatMaN is implemented in C++ and is distributed under the GNU General Public License. It is available at http://bioinf.eva.mpg.de/patman. The program reads sequences in FASTA format and reports all hits within the given edit-distance cutoff. It was demonstrated by aligning Affymetrix HGU95-A microarray probes to the chimpanzee genome.
The algorithm constructs a keyword tree of all query sequences. If the ambiguity flag is set, all possible bases at ambiguous positions are added to the tree. The search for occurrences on forward and reverse strands is facilitated by adding the reverse complement of all query sequences to the same tree. The resulting graph consists of internal nodes with outgoing edges for all four possible bases and for the ambiguity base 'N'.
The time efficiency of the search algorithm is linear in the size of the target database, but depends heavily on the maximum edit distance as well as the average length and number of query sequences. When ambiguity codes are not interpreted and the query sequences contain no 'N' character, the keyword tree can be constructed in O(L) time and requires O(L) space, where L represents the total length of all query sequences. When ambiguity is enabled, both time and space requirements increase exponentially in the number of ambiguity codes used in the patterns.
PatMaN is suitable for searching short sequences with a limited number of edit operations. It was used to match 201,807 Affymetrix HGU95-A microarray 25mer probes to the chimpanzee genome, finding 15.9 million hits. The program's performance was tested on different hardware configurations, showing an exponential increase in runtime with the maximum allowed edit distance. PatMaN is a new tool for mapping short sequences to large nucleotide databases, and is expected to have further applications in the near future, particularly in the analysis of gene expression using next generation resequencing technology.PatMaN is a tool for rapidly aligning short nucleotide sequences to large databases, allowing for a predefined number of gaps and mismatches. It uses a non-deterministic automata matching algorithm on a keyword tree of the search strings. The program is command-line driven and can handle both queries with and without ambiguity codes. Search time is short for perfect matches, but retrieval time increases exponentially with the number of allowed edits.
PatMaN is implemented in C++ and is distributed under the GNU General Public License. It is available at http://bioinf.eva.mpg.de/patman. The program reads sequences in FASTA format and reports all hits within the given edit-distance cutoff. It was demonstrated by aligning Affymetrix HGU95-A microarray probes to the chimpanzee genome.
The algorithm constructs a keyword tree of all query sequences. If the ambiguity flag is set, all possible bases at ambiguous positions are added to the tree. The search for occurrences on forward and reverse strands is facilitated by adding the reverse complement of all query sequences to the same tree. The resulting graph consists of internal nodes with outgoing edges for all four possible bases and for the ambiguity base 'N'.
The time efficiency of the search algorithm is linear in the size of the target database, but depends heavily on the maximum edit distance as well as the average length and number of query sequences. When ambiguity codes are not interpreted and the query sequences contain no 'N' character, the keyword tree can be constructed in O(L) time and requires O(L) space, where L represents the total length of all query sequences. When ambiguity is enabled, both time and space requirements increase exponentially in the number of ambiguity codes used in the patterns.
PatMaN is suitable for searching short sequences with a limited number of edit operations. It was used to match 201,807 Affymetrix HGU95-A microarray 25mer probes to the chimpanzee genome, finding 15.9 million hits. The program's performance was tested on different hardware configurations, showing an exponential increase in runtime with the maximum allowed edit distance. PatMaN is a new tool for mapping short sequences to large nucleotide databases, and is expected to have further applications in the near future, particularly in the analysis of gene expression using next generation resequencing technology.