24 Oct 2018 | Anna Veronika Dorogush, Vasily Ershov, Andrey Gulin
CatBoost is a new open-source gradient boosting library that effectively handles categorical features and outperforms existing implementations in terms of performance on popular datasets. It features both GPU and CPU implementations, with the GPU version significantly faster than other gradient boosting libraries. The library uses a novel approach to handle categorical features during training, reducing overfitting and improving performance. CatBoost also uses a new schema for calculating leaf values during tree construction, which helps in reducing overfitting. The algorithm is designed to handle categorical features efficiently by using a random permutation of the dataset and adding a prior value to avoid overfitting. It also supports feature combinations, which can enhance model performance. Additionally, CatBoost uses oblivious trees as base predictors, which are balanced and less prone to overfitting. The library also includes a fast scorer that uses binary features to calculate model predictions efficiently. The GPU implementation of CatBoost is optimized for performance, using efficient memory management and parallel processing techniques. Experiments show that CatBoost outperforms other gradient boosting libraries like XGBoost, LightGBM, and H2O in terms of both training and scoring performance. The library is open-source and available for use.CatBoost is a new open-source gradient boosting library that effectively handles categorical features and outperforms existing implementations in terms of performance on popular datasets. It features both GPU and CPU implementations, with the GPU version significantly faster than other gradient boosting libraries. The library uses a novel approach to handle categorical features during training, reducing overfitting and improving performance. CatBoost also uses a new schema for calculating leaf values during tree construction, which helps in reducing overfitting. The algorithm is designed to handle categorical features efficiently by using a random permutation of the dataset and adding a prior value to avoid overfitting. It also supports feature combinations, which can enhance model performance. Additionally, CatBoost uses oblivious trees as base predictors, which are balanced and less prone to overfitting. The library also includes a fast scorer that uses binary features to calculate model predictions efficiently. The GPU implementation of CatBoost is optimized for performance, using efficient memory management and parallel processing techniques. Experiments show that CatBoost outperforms other gradient boosting libraries like XGBoost, LightGBM, and H2O in terms of both training and scoring performance. The library is open-source and available for use.