CatBoost: gradient boosting with categorical features support

CatBoost: gradient boosting with categorical features support

24 Oct 2018 | Anna Veronika Dorogush, Vasily Ershov, Andrey Gulin
This paper introduces CatBoost, a new open-source gradient boosting library that effectively handles categorical features and outperforms existing implementations like XGBoost, LightGBM, and H2O in terms of quality on various datasets. CatBoost supports both CPU and GPU implementations, offering significantly faster training and scoring compared to other gradient boosting libraries. The library uses a GPU implementation for faster learning and a CPU implementation for scoring, which are both more efficient than other libraries on ensembles of similar sizes. CatBoost addresses the challenge of handling categorical features by integrating them into the training process rather than pre-processing them. It employs a novel method for calculating leaf values during tree structure selection, which helps reduce overfitting. Additionally, CatBoost uses oblivious trees, which are balanced and less prone to overfitting, and supports fast scorer and GPU training, further enhancing performance. The paper includes a detailed comparison of CatBoost with other gradient boosting libraries on various datasets, demonstrating its superior performance in classification tasks. It also provides insights into the GPU vs. CPU training performance, showing that CatBoost's GPU implementation can achieve a significant speedup, especially on older generation GPUs. The experimental results highlight CatBoost's effectiveness and efficiency in handling both dense numerical and categorical features.This paper introduces CatBoost, a new open-source gradient boosting library that effectively handles categorical features and outperforms existing implementations like XGBoost, LightGBM, and H2O in terms of quality on various datasets. CatBoost supports both CPU and GPU implementations, offering significantly faster training and scoring compared to other gradient boosting libraries. The library uses a GPU implementation for faster learning and a CPU implementation for scoring, which are both more efficient than other libraries on ensembles of similar sizes. CatBoost addresses the challenge of handling categorical features by integrating them into the training process rather than pre-processing them. It employs a novel method for calculating leaf values during tree structure selection, which helps reduce overfitting. Additionally, CatBoost uses oblivious trees, which are balanced and less prone to overfitting, and supports fast scorer and GPU training, further enhancing performance. The paper includes a detailed comparison of CatBoost with other gradient boosting libraries on various datasets, demonstrating its superior performance in classification tasks. It also provides insights into the GPU vs. CPU training performance, showing that CatBoost's GPU implementation can achieve a significant speedup, especially on older generation GPUs. The experimental results highlight CatBoost's effectiveness and efficiency in handling both dense numerical and categorical features.
Reach us at info@study.space
Understanding CatBoost%3A gradient boosting with categorical features support