December 18, 2009 | Bo Li¹, Victor Ruotti², Ron M. Stewart², James A. Thomson² and Colin N. Dewey¹,³,*
RNA-Seq is a powerful technology for accurately measuring gene expression levels. However, RNA-Seq reads often map to multiple genes and isoforms, complicating expression analysis. Previous methods either discard such reads or allocate them heuristically. This study presents a statistical model and inference method that handle read mapping uncertainty, improving accuracy. The model estimates gene expression as the sum of isoform expression levels and accounts for non-uniform read distributions. Simulations show that reads of 20–25 bases are optimal for gene-level expression estimation in mouse and maize. The method outperforms previous approaches, especially in repetitive genomes like maize. It also addresses read mapping uncertainty through a generative model, allowing for more accurate expression estimation. The method is implemented in C++ and available online. The study highlights the importance of considering read mapping uncertainty and non-uniform read distributions in RNA-Seq analysis. The results suggest that shorter reads with higher throughput are more effective for gene expression estimation than longer reads. The method is particularly useful for handling gene multireads and isoform multireads, providing more accurate gene-level expression estimates. The study also shows that the method performs well across different species and tissues, with optimal read lengths varying depending on the organism. The results indicate that RNA-Seq can provide more accurate gene expression estimates than microarrays, especially when accounting for read mapping uncertainty. The study emphasizes the need for further research into RNA-Seq data analysis, particularly in handling non-uniform read distributions and improving the accuracy of expression estimation.RNA-Seq is a powerful technology for accurately measuring gene expression levels. However, RNA-Seq reads often map to multiple genes and isoforms, complicating expression analysis. Previous methods either discard such reads or allocate them heuristically. This study presents a statistical model and inference method that handle read mapping uncertainty, improving accuracy. The model estimates gene expression as the sum of isoform expression levels and accounts for non-uniform read distributions. Simulations show that reads of 20–25 bases are optimal for gene-level expression estimation in mouse and maize. The method outperforms previous approaches, especially in repetitive genomes like maize. It also addresses read mapping uncertainty through a generative model, allowing for more accurate expression estimation. The method is implemented in C++ and available online. The study highlights the importance of considering read mapping uncertainty and non-uniform read distributions in RNA-Seq analysis. The results suggest that shorter reads with higher throughput are more effective for gene expression estimation than longer reads. The method is particularly useful for handling gene multireads and isoform multireads, providing more accurate gene-level expression estimates. The study also shows that the method performs well across different species and tissues, with optimal read lengths varying depending on the organism. The results indicate that RNA-Seq can provide more accurate gene expression estimates than microarrays, especially when accounting for read mapping uncertainty. The study emphasizes the need for further research into RNA-Seq data analysis, particularly in handling non-uniform read distributions and improving the accuracy of expression estimation.