The effect of dataset construction and data pre-processing on the eXtreme Gradient Boosting algorithm applied to head rice yield prediction in Australia

The effect of dataset construction and data pre-processing on the eXtreme Gradient Boosting algorithm applied to head rice yield prediction in Australia

2024 | A. Clarke, D. Yates, C. Blanchard, M.Z. Islam, R. Ford, S. Rehman, R. Walsh
This study investigates the impact of dataset construction and data pre-processing on the accuracy of a Head Rice Yield (HRY) prediction model using the eXtreme Gradient Boosting (XGBoost) algorithm. The research focuses on an industry-level dataset provided by SunRice, Australia's leading rice trading company. Two dataset construction methods were employed: one based on aggregating meteorological conditions using estimates of phenology, and the other using defined time periods. Deviations in these methods were explored to assess their impact on model accuracy. Each constructed dataset underwent feature selection before being trained using XGBoost with Leave-One-Out Cross-Validation. The time-based dataset construction method yielded the highest mean model accuracy, with the two-week aggregation dataset showing a 125% increase in Lin’s Concordance Correlation Coefficient compared to the worst-performing model. The study highlights the importance of accurate crop stage knowledge for improving future rice crop management and the potential for SunRice to predict HRY at the reception point to optimize post-harvest handling and milling. The findings also suggest that the dataset construction methods can be replicated in other rice-growing regions when matched with region-specific data.This study investigates the impact of dataset construction and data pre-processing on the accuracy of a Head Rice Yield (HRY) prediction model using the eXtreme Gradient Boosting (XGBoost) algorithm. The research focuses on an industry-level dataset provided by SunRice, Australia's leading rice trading company. Two dataset construction methods were employed: one based on aggregating meteorological conditions using estimates of phenology, and the other using defined time periods. Deviations in these methods were explored to assess their impact on model accuracy. Each constructed dataset underwent feature selection before being trained using XGBoost with Leave-One-Out Cross-Validation. The time-based dataset construction method yielded the highest mean model accuracy, with the two-week aggregation dataset showing a 125% increase in Lin’s Concordance Correlation Coefficient compared to the worst-performing model. The study highlights the importance of accurate crop stage knowledge for improving future rice crop management and the potential for SunRice to predict HRY at the reception point to optimize post-harvest handling and milling. The findings also suggest that the dataset construction methods can be replicated in other rice-growing regions when matched with region-specific data.
Reach us at info@study.space
Understanding The effect of dataset construction and data pre-processing on the eXtreme Gradient Boosting algorithm applied to head rice yield prediction in Australia