Dealing with the big data challenges in AI for thermoelectric materials

Dealing with the big data challenges in AI for thermoelectric materials

April 2024 | Xue Jia, Alex Aziz, Yusuke Hashimoto, Hao Li
The development of artificial intelligence (AI), particularly data science and machine learning (ML), is transforming material science. However, challenges remain, including errors in large-scale material datasets and overfitting of temperature-dependent properties. This study addresses these issues using thermoelectric (TE) materials as an example. By identifying and discarding questionable data from the Starrydata2 database, 92,291 data points were obtained. A composition-based cross-validation method was proposed to prevent overfitting by ensuring data points with the same composition but different temperatures are not split into different sets. ML models using gradient boosting decision trees (GBDT) achieved high $ R^2 $ values, demonstrating their accuracy in predicting TE properties. Using this model, stable materials from the Materials Project database were evaluated, predicting high zT values for $ Ge_2Te_5As_2 $ and $ Ge_3(Te_3As)_2 $. Density functional theory (DFT) calculations confirmed these predictions, showing maximum zT values of 1.98 and 2.12 for n- and p-type $ Ge_2Te_5As_2 $, and 0.58 and 0.74 for n- and p-type $ Ge_3(Te_3As)_2 $. The study highlights the importance of handling big data challenges in AI for materials science, emphasizing the need for data preprocessing and validation to ensure model reliability. The results demonstrate the potential of ML in accelerating the discovery and optimization of TE materials.The development of artificial intelligence (AI), particularly data science and machine learning (ML), is transforming material science. However, challenges remain, including errors in large-scale material datasets and overfitting of temperature-dependent properties. This study addresses these issues using thermoelectric (TE) materials as an example. By identifying and discarding questionable data from the Starrydata2 database, 92,291 data points were obtained. A composition-based cross-validation method was proposed to prevent overfitting by ensuring data points with the same composition but different temperatures are not split into different sets. ML models using gradient boosting decision trees (GBDT) achieved high $ R^2 $ values, demonstrating their accuracy in predicting TE properties. Using this model, stable materials from the Materials Project database were evaluated, predicting high zT values for $ Ge_2Te_5As_2 $ and $ Ge_3(Te_3As)_2 $. Density functional theory (DFT) calculations confirmed these predictions, showing maximum zT values of 1.98 and 2.12 for n- and p-type $ Ge_2Te_5As_2 $, and 0.58 and 0.74 for n- and p-type $ Ge_3(Te_3As)_2 $. The study highlights the importance of handling big data challenges in AI for materials science, emphasizing the need for data preprocessing and validation to ensure model reliability. The results demonstrate the potential of ML in accelerating the discovery and optimization of TE materials.
Reach us at info@study.space