25 March 2024 | Rémi Thériault · Mattan S. Ben-Shachar · Indrajeet Patil · Daniel Lüdecke · Brenton M. Wiernik · Dominique Makowski
This paper introduces methods for identifying statistical outliers in R using the {performance} package from the easystats ecosystem. It covers univariate, multivariate, and model-based outlier detection methods, their recommended thresholds, outputs, and plotting techniques. The paper also discusses different types of outliers, whether to exclude or winsorize them, and the importance of transparency in outlier treatment.
Outliers are observations that deviate significantly from the main population. They can be due to different distributions or extreme cases. Improper handling of outliers can affect statistical estimates and model performance. Therefore, it is essential to address outliers thoughtfully.
Many researchers do not consistently apply outlier detection methods or use inappropriate strategies. The paper aims to help researchers choose and apply the correct methods. It explains that while measures based on the mean (e.g., z-scores) are common, they are not robust to outliers. Instead, robust methods using the median and median absolute deviation (MAD) are recommended.
The paper discusses various outlier detection methods, including univariate, multivariate, and model-based approaches. It emphasizes that the choice of method depends on factors such as the statistical test of interest and the nature of the data. Model-based outlier detection involves identifying observations where the regression model does not fit well, while distribution-based methods rely on the distance from the center of the population.
The paper also highlights the importance of transparency in outlier treatment. Researchers should commit to an outlier treatment method before data collection and document their decisions and methods. This helps reduce false positives due to excessive researcher flexibility.
The paper provides examples of how to implement these methods in R using the {performance} package. It includes code for identifying univariate outliers using the check_outliers() function with the "zscore_robust" method. The threshold for identifying outliers is set to approximately 3.29 MAD by default, which is less conservative than some recommended thresholds. Users can adjust this threshold using the threshold argument.This paper introduces methods for identifying statistical outliers in R using the {performance} package from the easystats ecosystem. It covers univariate, multivariate, and model-based outlier detection methods, their recommended thresholds, outputs, and plotting techniques. The paper also discusses different types of outliers, whether to exclude or winsorize them, and the importance of transparency in outlier treatment.
Outliers are observations that deviate significantly from the main population. They can be due to different distributions or extreme cases. Improper handling of outliers can affect statistical estimates and model performance. Therefore, it is essential to address outliers thoughtfully.
Many researchers do not consistently apply outlier detection methods or use inappropriate strategies. The paper aims to help researchers choose and apply the correct methods. It explains that while measures based on the mean (e.g., z-scores) are common, they are not robust to outliers. Instead, robust methods using the median and median absolute deviation (MAD) are recommended.
The paper discusses various outlier detection methods, including univariate, multivariate, and model-based approaches. It emphasizes that the choice of method depends on factors such as the statistical test of interest and the nature of the data. Model-based outlier detection involves identifying observations where the regression model does not fit well, while distribution-based methods rely on the distance from the center of the population.
The paper also highlights the importance of transparency in outlier treatment. Researchers should commit to an outlier treatment method before data collection and document their decisions and methods. This helps reduce false positives due to excessive researcher flexibility.
The paper provides examples of how to implement these methods in R using the {performance} package. It includes code for identifying univariate outliers using the check_outliers() function with the "zscore_robust" method. The threshold for identifying outliers is set to approximately 3.29 MAD by default, which is less conservative than some recommended thresholds. Users can adjust this threshold using the threshold argument.