17 Jan 2024 | David Thulke, Yingbo Gao, Petrus Pelser, Rein Brune, Richa Jalota, Floris Fok, Michael Ramos, Ian van Wyk, Abdallah Nasir, Hayden Goldstein, Taylor Tragemann, Katie Nguyen, Ariana Fowler, Andrew Stanco, Jon Gabriel, Jordan Taylor, Dean Moro, Evgenii Tsymbalov, Juliette de Waal, Evgeny Matusov, Mudar Yaghi, Mohammad Shihadah, Hermann Ney, Christian Dugast, Jonathan Dotan, Daniel Erasmus
This paper introduces ClimateGPT, a family of domain-specific large language models designed to synthesize interdisciplinary research on climate change. The authors trained two 7B models from scratch on a 300B-token science-oriented dataset, with one model including 4.2B domain-specific tokens during pre-training and the other adapted to the climate domain after pre-training. Additionally, ClimateGPT-7B, 13B, and 70B models are continuously pre-trained from Llama 2 on a 4.2B-token domain-specific dataset. Each model is instruction fine-tuned on a high-quality, human-generated domain-specific dataset created in collaboration with climate scientists. To reduce hallucinations, the models are optimized for retrieval augmentation and a hierarchical retrieval strategy is proposed. To increase accessibility, cascaded machine translation is used to support multiple languages. The model can produce in-depth answers focusing on different perspectives in climate change research. The authors propose a suite of automatic climate-specific benchmarks to evaluate LLMs, showing that ClimateGPT-7B performs on par with the ten times larger Llama-2-70B Chat model while maintaining high performance on general domain benchmarks. Human evaluations confirm the trends observed in automatic evaluations. All models were trained and evaluated using renewable energy and are publicly available.This paper introduces ClimateGPT, a family of domain-specific large language models designed to synthesize interdisciplinary research on climate change. The authors trained two 7B models from scratch on a 300B-token science-oriented dataset, with one model including 4.2B domain-specific tokens during pre-training and the other adapted to the climate domain after pre-training. Additionally, ClimateGPT-7B, 13B, and 70B models are continuously pre-trained from Llama 2 on a 4.2B-token domain-specific dataset. Each model is instruction fine-tuned on a high-quality, human-generated domain-specific dataset created in collaboration with climate scientists. To reduce hallucinations, the models are optimized for retrieval augmentation and a hierarchical retrieval strategy is proposed. To increase accessibility, cascaded machine translation is used to support multiple languages. The model can produce in-depth answers focusing on different perspectives in climate change research. The authors propose a suite of automatic climate-specific benchmarks to evaluate LLMs, showing that ClimateGPT-7B performs on par with the ten times larger Llama-2-70B Chat model while maintaining high performance on general domain benchmarks. Human evaluations confirm the trends observed in automatic evaluations. All models were trained and evaluated using renewable energy and are publicly available.