27 Jan 2024 | Noah D. Brenowitz, Yair Cohen, Jaideep Pathak, Ankur Mahesh, Boris Bonev, Thorsten Kurth, Dale R. Durran, Peter Harrington, Michael S. Pritchard
This paper addresses the challenge of evaluating the probabilistic skill of AI weather models, which have primarily been assessed using deterministic skill scores. The authors introduce a practical and parameter-free benchmark, the lagged ensemble (LEF), to compare the probabilistic skill of leading AI weather models against an operational baseline. LEF constructs an ensemble from a library of deterministic forecasts, allowing for a fair comparison without the need for ensemble initialization techniques or noise injection methods. The results show that two prominent AI models, GraphCast and Pangu, are tied on the probabilistic continuous Ranked Probability Score (CRPS) metric, despite GraphCast outperforming Pangu in deterministic scoring. The study also reveals that multi-step loss functions, commonly used in data-driven models, can improve deterministic metrics but deteriorate probabilistic skill. This is confirmed through ablations on a Spherical Fourier Neural Operator (SFNO) model, which demonstrates that reducing effective resolution can improve ensemble dispersion relevant to good calibration. The authors hope that these insights will guide the development of more accurate and reliable AI weather forecasts.This paper addresses the challenge of evaluating the probabilistic skill of AI weather models, which have primarily been assessed using deterministic skill scores. The authors introduce a practical and parameter-free benchmark, the lagged ensemble (LEF), to compare the probabilistic skill of leading AI weather models against an operational baseline. LEF constructs an ensemble from a library of deterministic forecasts, allowing for a fair comparison without the need for ensemble initialization techniques or noise injection methods. The results show that two prominent AI models, GraphCast and Pangu, are tied on the probabilistic continuous Ranked Probability Score (CRPS) metric, despite GraphCast outperforming Pangu in deterministic scoring. The study also reveals that multi-step loss functions, commonly used in data-driven models, can improve deterministic metrics but deteriorate probabilistic skill. This is confirmed through ablations on a Spherical Fourier Neural Operator (SFNO) model, which demonstrates that reducing effective resolution can improve ensemble dispersion relevant to good calibration. The authors hope that these insights will guide the development of more accurate and reliable AI weather forecasts.