VOL.15 NO.4 | APRIL 2018 | Danilo Bzdok, Naomi Altman & Martin Krzywinski
The chapter discusses the differences and similarities between statistical methods and machine learning (ML) in the context of biological systems. Statistics focuses on inference, creating and fitting probability models to understand data-generation processes and test hypotheses. ML, on the other hand, emphasizes prediction by using general-purpose algorithms to find patterns in large, complex datasets. While both methods can be used for both inference and prediction, they differ in their assumptions, computational tractability, and the types of data they handle best.
Inference methods, such as classical statistics, are effective with fewer input variables and moderate sample sizes, making them suitable for capturing complex relationships. However, as the number of variables increases, statistical inferences become less precise, blurring the line between statistical and ML approaches.
ML methods, particularly those like random forests, are robust to high-dimensional data and can handle nonlinear interactions, making them useful for 'wide data' scenarios. However, the lack of an explicit model can make ML solutions less interpretable in the context of biological knowledge.
The chapter uses a simulation of gene expression data to compare classical statistical inference and ML approaches. Both methods yield similar results in identifying dysregulated genes, with an average of 7.4/10 and 7.7/10 dysregulated genes correctly identified, respectively. The simulation highlights the complementary nature of statistical inference and ML, each offering unique strengths in different contexts.The chapter discusses the differences and similarities between statistical methods and machine learning (ML) in the context of biological systems. Statistics focuses on inference, creating and fitting probability models to understand data-generation processes and test hypotheses. ML, on the other hand, emphasizes prediction by using general-purpose algorithms to find patterns in large, complex datasets. While both methods can be used for both inference and prediction, they differ in their assumptions, computational tractability, and the types of data they handle best.
Inference methods, such as classical statistics, are effective with fewer input variables and moderate sample sizes, making them suitable for capturing complex relationships. However, as the number of variables increases, statistical inferences become less precise, blurring the line between statistical and ML approaches.
ML methods, particularly those like random forests, are robust to high-dimensional data and can handle nonlinear interactions, making them useful for 'wide data' scenarios. However, the lack of an explicit model can make ML solutions less interpretable in the context of biological knowledge.
The chapter uses a simulation of gene expression data to compare classical statistical inference and ML approaches. Both methods yield similar results in identifying dysregulated genes, with an average of 7.4/10 and 7.7/10 dysregulated genes correctly identified, respectively. The simulation highlights the complementary nature of statistical inference and ML, each offering unique strengths in different contexts.