[slides and audio] Detecting Novel Associations in Large Data Sets

The paper introduces the Maximal Information Coefficient (MIC), a new measure for detecting associations between pairs of variables in large datasets. MIC is designed to capture a wide range of relationships, including both functional and non-functional ones, and is based on mutual information. It is part of a broader class of statistics called MINE, which are used to identify and classify relationships. The authors applied MIC and MINE to datasets in global health, gene expression, baseball, and the human gut microbiome, identifying both known and novel associations. MIC is characterized by two key properties: generality and equitability. Generality refers to its ability to capture a wide range of associations, not limited to specific functional forms. Equitability means that MIC gives similar scores to relationships of different types that have similar levels of noise. The authors demonstrated that MIC satisfies these properties through mathematical proofs and simulations. Compared to other methods for detecting associations, such as mutual information estimators, maximal correlation, and distance correlation, MIC is more general and equitable. It performs well across a variety of relationship types, including non-linear and complex ones. Additionally, MIC can be used to characterize relationships based on properties such as non-linearity and monotonicity. The authors also extended MIC to a broader class of MINE statistics, which can be used to detect various properties of relationships, including complexity and closeness to being a function. These statistics were applied to real-world datasets, including global health indicators, yeast gene expression data, and gut microbiome data, where they identified significant associations. The study highlights the importance of tools like MIC in analyzing large, complex datasets across diverse fields. MIC provides a versatile and effective method for identifying and characterizing relationships, making it a valuable tool for data exploration.The paper introduces the Maximal Information Coefficient (MIC), a new measure for detecting associations between pairs of variables in large datasets. MIC is designed to capture a wide range of relationships, including both functional and non-functional ones, and is based on mutual information. It is part of a broader class of statistics called MINE, which are used to identify and classify relationships. The authors applied MIC and MINE to datasets in global health, gene expression, baseball, and the human gut microbiome, identifying both known and novel associations. MIC is characterized by two key properties: generality and equitability. Generality refers to its ability to capture a wide range of associations, not limited to specific functional forms. Equitability means that MIC gives similar scores to relationships of different types that have similar levels of noise. The authors demonstrated that MIC satisfies these properties through mathematical proofs and simulations. Compared to other methods for detecting associations, such as mutual information estimators, maximal correlation, and distance correlation, MIC is more general and equitable. It performs well across a variety of relationship types, including non-linear and complex ones. Additionally, MIC can be used to characterize relationships based on properties such as non-linearity and monotonicity. The authors also extended MIC to a broader class of MINE statistics, which can be used to detect various properties of relationships, including complexity and closeness to being a function. These statistics were applied to real-world datasets, including global health indicators, yeast gene expression data, and gut microbiome data, where they identified significant associations. The study highlights the importance of tools like MIC in analyzing large, complex datasets across diverse fields. MIC provides a versatile and effective method for identifying and characterizing relationships, making it a valuable tool for data exploration.

Detecting Novel Associations in Large Datasets

2011 December 16 | David N. Reshef, Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, and Pardis C. Sabeti