Differential expression analysis for sequence count data

Differential expression analysis for sequence count data

2010 | Simon Anders, Wolfgang Huber
This paper introduces DESeq, an R/Bioconductor package for differential expression analysis of sequence count data. The method is based on the negative binomial distribution, which models count data with a mean and variance that are linked by local regression. This approach addresses the overdispersion problem common in count data, where the variance is greater than the mean, which is a limitation of the Poisson distribution. DESeq estimates the mean and variance parameters for each gene, allowing for more accurate statistical testing of differential expression. The method assumes that the mean of the count data is proportional to the expected abundance of the gene under a given condition, and that the variance is a function of the mean and the size factor, which accounts for sequencing depth. DESeq uses local regression to estimate the variance function, which allows for more flexible modeling of the relationship between mean and variance. DESeq is tested on various data sets, including RNA-Seq, ChIP-Seq, and Tag-Seq data. It is shown to provide better results than alternative methods, such as edgeR, in controlling type-I error and detecting differentially expressed genes. The method is also applied to data with few replicates, where it performs well by using a variance-stabilizing transformation to improve the accuracy of differential expression analysis. The paper also discusses the importance of using a model that allows for overdispersion in count data, as the Poisson distribution may underestimate biological variability, particularly for highly expressed genes. DESeq's approach is computationally efficient and provides useful diagnostic tools for assessing the reliability of the results. The method is implemented as an R/Bioconductor package and is available for use in the statistical environment R.This paper introduces DESeq, an R/Bioconductor package for differential expression analysis of sequence count data. The method is based on the negative binomial distribution, which models count data with a mean and variance that are linked by local regression. This approach addresses the overdispersion problem common in count data, where the variance is greater than the mean, which is a limitation of the Poisson distribution. DESeq estimates the mean and variance parameters for each gene, allowing for more accurate statistical testing of differential expression. The method assumes that the mean of the count data is proportional to the expected abundance of the gene under a given condition, and that the variance is a function of the mean and the size factor, which accounts for sequencing depth. DESeq uses local regression to estimate the variance function, which allows for more flexible modeling of the relationship between mean and variance. DESeq is tested on various data sets, including RNA-Seq, ChIP-Seq, and Tag-Seq data. It is shown to provide better results than alternative methods, such as edgeR, in controlling type-I error and detecting differentially expressed genes. The method is also applied to data with few replicates, where it performs well by using a variance-stabilizing transformation to improve the accuracy of differential expression analysis. The paper also discusses the importance of using a model that allows for overdispersion in count data, as the Poisson distribution may underestimate biological variability, particularly for highly expressed genes. DESeq's approach is computationally efficient and provides useful diagnostic tools for assessing the reliability of the results. The method is implemented as an R/Bioconductor package and is available for use in the statistical environment R.
Reach us at info@futurestudyspace.com
[slides and audio] Differential expression analysis for sequence count data