23 January 2024 | Jean-Gabriel Gaudreault, Paula Branco
This paper addresses the challenge of evaluating the performance of machine learning models in imbalanced classification scenarios, where one class is significantly underrepresented. The authors experimentally study the impact of different performance metrics on the evaluation of binary classifiers, considering factors such as class imbalance and data noise. They provide guidelines for selecting the most appropriate metric based on the context of the problem. Specifically, they highlight the importance of using multiple metrics that are fundamentally different in imbalanced domains. The study also recommends using Davis' interpolation of the area under the precision-recall curve and the Matthews Correlation Coefficient over other similar metrics, while suggesting that the geometric mean and $F_1$ score should be avoided in scenarios with noisy labels. The paper includes a concrete example demonstrating how different metrics can lead to different model selections, emphasizing the need for researchers and end-users to carefully choose metrics that accurately reflect the desired outcome.This paper addresses the challenge of evaluating the performance of machine learning models in imbalanced classification scenarios, where one class is significantly underrepresented. The authors experimentally study the impact of different performance metrics on the evaluation of binary classifiers, considering factors such as class imbalance and data noise. They provide guidelines for selecting the most appropriate metric based on the context of the problem. Specifically, they highlight the importance of using multiple metrics that are fundamentally different in imbalanced domains. The study also recommends using Davis' interpolation of the area under the precision-recall curve and the Matthews Correlation Coefficient over other similar metrics, while suggesting that the geometric mean and $F_1$ score should be avoided in scenarios with noisy labels. The paper includes a concrete example demonstrating how different metrics can lead to different model selections, emphasizing the need for researchers and end-users to carefully choose metrics that accurately reflect the desired outcome.