Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms

Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms

2010 May | Cole Trapnell, Brian A. Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J. van Baren, Steven L. Salzberg, Barbara J. Wold, Lior Pachter
A new algorithm and software, Cufflinks, were developed to assemble transcripts and estimate their abundances from RNA-Seq data. The algorithm was tested using RNA-Seq data from a mouse myoblast cell line, revealing thousands of new transcripts and switching among isoforms. The study found that 62% of the newly identified transcripts were supported by independent expression data or homologous genes in other species. Analysis of transcript expression over time revealed complete switches in the dominant transcription start site (TSS) or splice-isoform in 330 genes, along with more subtle shifts in a further 1,304 genes. These dynamics suggest substantial regulatory flexibility and complexity in this well-studied model of muscle development. The study also identified 7,770 genes and 10,480 isoforms undergoing significant abundance changes between some successive pairs of time points (FDR < 5%). Many genes displayed substantial transcript-level dynamics not reflected in their overall expression patterns. For example, Myc, a proto-oncogene known to be transcriptionally and post-transcriptionally regulated during myogenesis, showed complex expression patterns among its isoforms. The study also found that many genes featured dynamics involving several isoforms with behavior too complex to be deemed "switching." The study classified the patterns of expression dynamics for transcripts into four "trajectories" based on their expression curves being flat, increasing, decreasing, or mixed. Based on trajectory classification, a total of 1,634 genes were found to have multiple isoforms with different trajectories in the time course. The study hypothesized that differential promoter preference and differential splicing were responsible for the divergent patterns. The study also validated novel transcription start sites and isoforms using various methods, including ChIP-Seq and endpoint RT-PCR. The results showed that the inclusion of novel isoforms of known genes during abundance estimation had a dramatic impact on the estimates of known isoforms in many genes. The study also demonstrated the robustness of the Cufflinks algorithm in estimating transcript abundances across different expression levels and sequencing depths. The software was found to be effective in assembling transcripts and estimating their abundances, and it is applicable to a broad range of RNA-Seq studies. The study concluded that not only is the impact of promoter-switching on mRNA output significant, many genes are also exhibiting evidence of post-transcriptionally induced expression changes, supporting a role for dynamic splicing regulation in myogenesis.A new algorithm and software, Cufflinks, were developed to assemble transcripts and estimate their abundances from RNA-Seq data. The algorithm was tested using RNA-Seq data from a mouse myoblast cell line, revealing thousands of new transcripts and switching among isoforms. The study found that 62% of the newly identified transcripts were supported by independent expression data or homologous genes in other species. Analysis of transcript expression over time revealed complete switches in the dominant transcription start site (TSS) or splice-isoform in 330 genes, along with more subtle shifts in a further 1,304 genes. These dynamics suggest substantial regulatory flexibility and complexity in this well-studied model of muscle development. The study also identified 7,770 genes and 10,480 isoforms undergoing significant abundance changes between some successive pairs of time points (FDR < 5%). Many genes displayed substantial transcript-level dynamics not reflected in their overall expression patterns. For example, Myc, a proto-oncogene known to be transcriptionally and post-transcriptionally regulated during myogenesis, showed complex expression patterns among its isoforms. The study also found that many genes featured dynamics involving several isoforms with behavior too complex to be deemed "switching." The study classified the patterns of expression dynamics for transcripts into four "trajectories" based on their expression curves being flat, increasing, decreasing, or mixed. Based on trajectory classification, a total of 1,634 genes were found to have multiple isoforms with different trajectories in the time course. The study hypothesized that differential promoter preference and differential splicing were responsible for the divergent patterns. The study also validated novel transcription start sites and isoforms using various methods, including ChIP-Seq and endpoint RT-PCR. The results showed that the inclusion of novel isoforms of known genes during abundance estimation had a dramatic impact on the estimates of known isoforms in many genes. The study also demonstrated the robustness of the Cufflinks algorithm in estimating transcript abundances across different expression levels and sequencing depths. The software was found to be effective in assembling transcripts and estimating their abundances, and it is applicable to a broad range of RNA-Seq studies. The study concluded that not only is the impact of promoter-switching on mRNA output significant, many genes are also exhibiting evidence of post-transcriptionally induced expression changes, supporting a role for dynamic splicing regulation in myogenesis.
Reach us at info@study.space