Understanding TopHat%3A discovering splice junctions with RNA-Seq

TopHat is a read-mapping algorithm designed to align RNA-Seq reads to a reference genome without relying on known splice sites. It efficiently identifies novel splice junctions by mapping reads to the genome and analyzing the resulting data. TopHat maps reads to splice sites at a rate of approximately 2.2 million reads per CPU hour, making it much faster than previous systems. It uses an efficient 2-bit-per-base encoding and a data layout that effectively uses the cache on modern processors. TopHat first maps non-junction reads using Bowtie, an ultra-fast short-read mapping program. Bowtie indexes the reference genome using a technique borrowed from data-compression, the Burrows–Wheeler transform. This memory-efficient data structure allows Bowtie to scan reads against a mammalian genome using around 2 GB of memory. TopHat then assembles the mapped reads using the assembly module in Maq. It extracts the sequences for the resulting islands of contiguous sequence from the sparse consensus, inferring them to be putative exons. To generate the island sequences, TopHat invokes the Maq assemble subcommand (with the -s flag) which produces a compact consensus file containing called bases and the corresponding reference bases. Because the consensus may include incorrect base calls due to sequencing errors in low-coverage regions, such islands may be a 'pseudoconsensus': for any low-coverage or low-quality positions, TopHat uses the reference genome to call the base. TopHat then enumerates all canonical donor and acceptor sites within the island sequences (as well as their reverse complements). Next, it considers all pairings of these sites that could form canonical (GT–AG) introns between neighboring (but not necessarily adjacent) islands. Each possible intron is checked against the IUM reads for reads that span the splice junction. By default, TopHat only examines potential introns longer than 70 bp and shorter than 20,000 bp, but these default minimum and maximum intron lengths can be adjusted by the user. These values describe the vast majority of known eukaryotic introns. TopHat searches the IUM reads in order to find reads that span junctions using a seed-and-extend strategy. The pipeline indexes the IUM reads using a simple lookup table to amortize the cost of searching for a spliced alignment over many reads. As illustrated in Figure 3, TopHat finds any reads that span splice junctions by at least k bases on each side (where k = 5 bp by default), so the table is keyed by 2k-mers, where each 2k-mer is associated with reads that contain that 2k-mer. For each read, the table contains (s - 2k + 1) entries corresponding to possible positions where a splice may fall within a read, where s is the length of the high-quality region on the 5' end (defaultTopHat is a read-mapping algorithm designed to align RNA-Seq reads to a reference genome without relying on known splice sites. It efficiently identifies novel splice junctions by mapping reads to the genome and analyzing the resulting data. TopHat maps reads to splice sites at a rate of approximately 2.2 million reads per CPU hour, making it much faster than previous systems. It uses an efficient 2-bit-per-base encoding and a data layout that effectively uses the cache on modern processors. TopHat first maps non-junction reads using Bowtie, an ultra-fast short-read mapping program. Bowtie indexes the reference genome using a technique borrowed from data-compression, the Burrows–Wheeler transform. This memory-efficient data structure allows Bowtie to scan reads against a mammalian genome using around 2 GB of memory. TopHat then assembles the mapped reads using the assembly module in Maq. It extracts the sequences for the resulting islands of contiguous sequence from the sparse consensus, inferring them to be putative exons. To generate the island sequences, TopHat invokes the Maq assemble subcommand (with the -s flag) which produces a compact consensus file containing called bases and the corresponding reference bases. Because the consensus may include incorrect base calls due to sequencing errors in low-coverage regions, such islands may be a 'pseudoconsensus': for any low-coverage or low-quality positions, TopHat uses the reference genome to call the base. TopHat then enumerates all canonical donor and acceptor sites within the island sequences (as well as their reverse complements). Next, it considers all pairings of these sites that could form canonical (GT–AG) introns between neighboring (but not necessarily adjacent) islands. Each possible intron is checked against the IUM reads for reads that span the splice junction. By default, TopHat only examines potential introns longer than 70 bp and shorter than 20,000 bp, but these default minimum and maximum intron lengths can be adjusted by the user. These values describe the vast majority of known eukaryotic introns. TopHat searches the IUM reads in order to find reads that span junctions using a seed-and-extend strategy. The pipeline indexes the IUM reads using a simple lookup table to amortize the cost of searching for a spliced alignment over many reads. As illustrated in Figure 3, TopHat finds any reads that span splice junctions by at least k bases on each side (where k = 5 bp by default), so the table is keyed by 2k-mers, where each 2k-mer is associated with reads that contain that 2k-mer. For each read, the table contains (s - 2k + 1) entries corresponding to possible positions where a splice may fall within a read, where s is the length of the high-quality region on the 5' end (default

TopHat: discovering splice junctions with RNA-Seq

March 16, 2009 | Cole Trapnell1,*, Lior Pachter2 and Steven L. Salzberg1