20 Jun 2013 | Simon Anders, Davis J. McCarthy, Yunshen Chen, Michal Okoniewski, Gordon K. Smyth, Wolfgang Huber, Mark D. Robinson
This article presents a comprehensive workflow for count-based differential expression analysis of RNA sequencing (RNA-seq) data using R and Bioconductor. The protocol focuses on two widely-used tools, DESeq and edgeR, which implement general differential expression analysis based on the negative binomial (NB) model. The workflow includes steps for read alignment, quality control, feature counting, and statistical analysis. The process is designed to be efficient, with hands-on time for small experiments typically under one hour and computation time under one day using a standard desktop PC.
The protocol covers the analysis of RNA-seq data to identify genes that change in abundance between different conditions, such as tissues or experimental treatments. It also addresses the need to account for systematic factors that may affect data collection. The statistical methods used are integral to the differential expression discovery task and operate on a feature count table. Before statistical modeling, further quality checks are encouraged to ensure the biological question can be addressed.
The protocol is modular and extensible, allowing for the integration of alternative tools and strategies. It includes a detailed description of the experimental design, including considerations for replication and confounding factors. The workflow is illustrated with an example dataset from Drosophila melanogaster, demonstrating the analysis of RNA-seq samples treated with siRNA targeting the splicing factor pasilla.
The protocol emphasizes the importance of using appropriate statistical models, such as the negative binomial model, to account for biological variability and technical variation. It also discusses the advantages and disadvantages of different tools for differential expression analysis, noting that edgeR and DESeq remain among the top performers despite the availability of many new tools.
The workflow includes steps for mapping reads to a reference genome, organizing and sorting BAM files, and counting reads using htseq-count. The protocol also provides guidance on the use of R and Bioconductor for data analysis, including the use of R functions for quality control, statistical modeling, and visualization. The article concludes with a detailed description of the example data and the steps required to analyze it, including the use of the ShortRead TIMING package for quality control and the edgeR and DESeq packages for differential expression analysis.This article presents a comprehensive workflow for count-based differential expression analysis of RNA sequencing (RNA-seq) data using R and Bioconductor. The protocol focuses on two widely-used tools, DESeq and edgeR, which implement general differential expression analysis based on the negative binomial (NB) model. The workflow includes steps for read alignment, quality control, feature counting, and statistical analysis. The process is designed to be efficient, with hands-on time for small experiments typically under one hour and computation time under one day using a standard desktop PC.
The protocol covers the analysis of RNA-seq data to identify genes that change in abundance between different conditions, such as tissues or experimental treatments. It also addresses the need to account for systematic factors that may affect data collection. The statistical methods used are integral to the differential expression discovery task and operate on a feature count table. Before statistical modeling, further quality checks are encouraged to ensure the biological question can be addressed.
The protocol is modular and extensible, allowing for the integration of alternative tools and strategies. It includes a detailed description of the experimental design, including considerations for replication and confounding factors. The workflow is illustrated with an example dataset from Drosophila melanogaster, demonstrating the analysis of RNA-seq samples treated with siRNA targeting the splicing factor pasilla.
The protocol emphasizes the importance of using appropriate statistical models, such as the negative binomial model, to account for biological variability and technical variation. It also discusses the advantages and disadvantages of different tools for differential expression analysis, noting that edgeR and DESeq remain among the top performers despite the availability of many new tools.
The workflow includes steps for mapping reads to a reference genome, organizing and sorting BAM files, and counting reads using htseq-count. The protocol also provides guidance on the use of R and Bioconductor for data analysis, including the use of R functions for quality control, statistical modeling, and visualization. The article concludes with a detailed description of the example data and the steps required to analyze it, including the use of the ShortRead TIMING package for quality control and the edgeR and DESeq packages for differential expression analysis.