Published online 8 March 2024 | Xue Jia, Alex Aziz, Yusuke Hashimoto and Hao Li
This study addresses the challenges of big data in AI for thermoelectric (TE) materials, focusing on the development of robust machine learning (ML) models. The authors, Xue Jia, Alex Aziz, Yusuke Hashimoto, and Hao Li, begin by identifying and discarding questionable data from the Starrydata2 database, resulting in a cleaned dataset of 92,291 data points. They propose a composition-based cross-validation method to avoid overfitting, ensuring that data points with the same compositions but different temperatures are not split into separate sets. Using gradient boosting decision tree (GBDT) models, they achieve high R² values of ~0.89, ~0.90, and ~0.89 on training, test, and out-of-sample datasets, respectively. This model is then used to predict the stability of materials from the Materials Project database, identifying Ge₂Te₃As₂ and Ge₄(Te,As)₅ as promising candidates with high zT values. Density functional theory (DFT) calculations confirm these predictions, validating the model's accuracy. The study highlights the importance of data preprocessing, cross-validation, and the integration of ML techniques in advancing the field of TE materials.This study addresses the challenges of big data in AI for thermoelectric (TE) materials, focusing on the development of robust machine learning (ML) models. The authors, Xue Jia, Alex Aziz, Yusuke Hashimoto, and Hao Li, begin by identifying and discarding questionable data from the Starrydata2 database, resulting in a cleaned dataset of 92,291 data points. They propose a composition-based cross-validation method to avoid overfitting, ensuring that data points with the same compositions but different temperatures are not split into separate sets. Using gradient boosting decision tree (GBDT) models, they achieve high R² values of ~0.89, ~0.90, and ~0.89 on training, test, and out-of-sample datasets, respectively. This model is then used to predict the stability of materials from the Materials Project database, identifying Ge₂Te₃As₂ and Ge₄(Te,As)₅ as promising candidates with high zT values. Density functional theory (DFT) calculations confirm these predictions, validating the model's accuracy. The study highlights the importance of data preprocessing, cross-validation, and the integration of ML techniques in advancing the field of TE materials.