Detecting Novel Associations in Large Datasets

Detecting Novel Associations in Large Datasets

2011 December 16; 334(6062): 1518–1524 | David N. Reshef, Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, and Pardis C. Sabeti
The paper introduces the maximal information coefficient (MIC) as a measure of dependence for two-variable relationships in large datasets. MIC is designed to capture a wide range of associations, both functional and non-functional, and provides a score that roughly equals the coefficient of determination (R²) for functional relationships. The authors demonstrate that MIC is generative and equitable, meaning it can identify a broad class of interesting associations and gives similar scores to equally noisy relationships of different types. They also show that MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics, which can be used to identify and characterize relationships based on properties such as non-linearity and monotonicity. The paper applies MIC and MINE to datasets in global health, gene expression, baseball, and the human gut microbiota, identifying both known and novel relationships. The authors conclude that MINE is a useful tool for identifying and characterizing structure in complex datasets across diverse fields.The paper introduces the maximal information coefficient (MIC) as a measure of dependence for two-variable relationships in large datasets. MIC is designed to capture a wide range of associations, both functional and non-functional, and provides a score that roughly equals the coefficient of determination (R²) for functional relationships. The authors demonstrate that MIC is generative and equitable, meaning it can identify a broad class of interesting associations and gives similar scores to equally noisy relationships of different types. They also show that MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics, which can be used to identify and characterize relationships based on properties such as non-linearity and monotonicity. The paper applies MIC and MINE to datasets in global health, gene expression, baseball, and the human gut microbiota, identifying both known and novel relationships. The authors conclude that MINE is a useful tool for identifying and characterizing structure in complex datasets across diverse fields.
Reach us at info@study.space