APRIL 2018 | Danilo Bzdok, Naomi Altman & Martin Krzywinski
Statistics and machine learning (ML) have distinct goals in biological research: statistics focuses on inference, building models to understand data generation and test hypotheses, while ML focuses on prediction, identifying patterns in complex data. Both inference and prediction are valuable in biological research, as they help understand biological processes and predict outcomes. Statistical methods rely on probability models to quantify confidence in findings, while ML methods use algorithms to find patterns without assuming a data-generating model. ML is particularly effective with 'wide data' (many variables) and can handle complex, nonlinear interactions without a controlled design. However, ML lacks explicit models, making it harder to relate findings to biological knowledge.
As the number of variables increases, statistical methods become less tractable, while ML becomes more effective. A simulation of gene expression data across two phenotypes illustrates the differences between statistical inference and ML. Inference identifies genes with significant expression differences, while ML, using random forests, identifies genes important for phenotype classification. Both methods identify similar numbers of dysregulated genes, but inference is more sensitive to small effect sizes.
The boundary between statistics and ML is blurred, with many methods used in both. Statistics requires models based on biological knowledge, while ML relies on empirical performance. Inference and ML are complementary, each offering insights into biological processes. The choice between methods depends on the research question and data characteristics. Both approaches are essential for understanding complex biological systems.Statistics and machine learning (ML) have distinct goals in biological research: statistics focuses on inference, building models to understand data generation and test hypotheses, while ML focuses on prediction, identifying patterns in complex data. Both inference and prediction are valuable in biological research, as they help understand biological processes and predict outcomes. Statistical methods rely on probability models to quantify confidence in findings, while ML methods use algorithms to find patterns without assuming a data-generating model. ML is particularly effective with 'wide data' (many variables) and can handle complex, nonlinear interactions without a controlled design. However, ML lacks explicit models, making it harder to relate findings to biological knowledge.
As the number of variables increases, statistical methods become less tractable, while ML becomes more effective. A simulation of gene expression data across two phenotypes illustrates the differences between statistical inference and ML. Inference identifies genes with significant expression differences, while ML, using random forests, identifies genes important for phenotype classification. Both methods identify similar numbers of dysregulated genes, but inference is more sensitive to small effect sizes.
The boundary between statistics and ML is blurred, with many methods used in both. Statistics requires models based on biological knowledge, while ML relies on empirical performance. Inference and ML are complementary, each offering insights into biological processes. The choice between methods depends on the research question and data characteristics. Both approaches are essential for understanding complex biological systems.