Species distribution models (SDMs) are widely used in conservation biology, ecology, and evolution to address various questions. These models require both species presence data and background or pseudo-absence data to accurately predict species distributions. However, there is no consensus on how to select pseudo-absences, where to sample them, or how many to use. This study provides guidelines for selecting pseudo-absences to build reliable SDMs.
The study used simulated data to evaluate the impact of pseudo-absence selection methods, numbers, and weighting schemes on model accuracy across seven common SDM techniques (regression, classification, and machine-learning). The results showed that for regression techniques (e.g., generalized linear models and generalized additive models), randomly selected pseudo-absences with equal weighting for presences and absences produced the most accurate models. For classification and machine-learning techniques (e.g., boosted regression trees, classification trees, and random forests), the number of pseudo-absences had the greatest impact on model accuracy, and averaging several runs with fewer pseudo-absences yielded the most predictive models.
The study also found that the optimal number of pseudo-absences varied depending on the SDM and the sampling design. For regression techniques, a large number of pseudo-absences (e.g., 10,000) with equal weighting was recommended. For classification and machine-learning techniques, the number of pseudo-absences should match the number of presences, with equal weighting. Additionally, pseudo-absences should be randomly selected when using regression techniques and randomly selected with geographical and environmental exclusion when using classification and machine-learning techniques.
The study also highlighted the importance of considering sampling bias in presence data. Climatically or spatially biased presence data can affect model accuracy, and the optimal use of pseudo-absences may vary depending on the type of bias. Overall, the study provides a comprehensive framework for selecting pseudo-absences to build reliable SDMs, emphasizing the need for a large number of pseudo-absences with equal weighting for presences and absences when using regression techniques, and matching the number of pseudo-absences to the number of presences for classification and machine-learning techniques.Species distribution models (SDMs) are widely used in conservation biology, ecology, and evolution to address various questions. These models require both species presence data and background or pseudo-absence data to accurately predict species distributions. However, there is no consensus on how to select pseudo-absences, where to sample them, or how many to use. This study provides guidelines for selecting pseudo-absences to build reliable SDMs.
The study used simulated data to evaluate the impact of pseudo-absence selection methods, numbers, and weighting schemes on model accuracy across seven common SDM techniques (regression, classification, and machine-learning). The results showed that for regression techniques (e.g., generalized linear models and generalized additive models), randomly selected pseudo-absences with equal weighting for presences and absences produced the most accurate models. For classification and machine-learning techniques (e.g., boosted regression trees, classification trees, and random forests), the number of pseudo-absences had the greatest impact on model accuracy, and averaging several runs with fewer pseudo-absences yielded the most predictive models.
The study also found that the optimal number of pseudo-absences varied depending on the SDM and the sampling design. For regression techniques, a large number of pseudo-absences (e.g., 10,000) with equal weighting was recommended. For classification and machine-learning techniques, the number of pseudo-absences should match the number of presences, with equal weighting. Additionally, pseudo-absences should be randomly selected when using regression techniques and randomly selected with geographical and environmental exclusion when using classification and machine-learning techniques.
The study also highlighted the importance of considering sampling bias in presence data. Climatically or spatially biased presence data can affect model accuracy, and the optimal use of pseudo-absences may vary depending on the type of bias. Overall, the study provides a comprehensive framework for selecting pseudo-absences to build reliable SDMs, emphasizing the need for a large number of pseudo-absences with equal weighting for presences and absences when using regression techniques, and matching the number of pseudo-absences to the number of presences for classification and machine-learning techniques.