17 Jan 2024 | David Thulke, Yingbo Gao, Petrus Pelser, Rein Brune, Richa Jalota, Floris Fok, Michael Ramos, Ian van Wyk, Abdallah Nasir, Hayden Goldstein, Taylor Tragemann, Katie Nguyen, Ariana Fowler, Andrew Stanco, Jon Gabriel, Jordan Taylor, Dean Moro, Evgenii Tsybalov, Juliette de Waal, Evgeny Matusov, Mudar Yaghi, Mohammad Shihadah, Hermann Ney, Christian Dugast, Jonathan Dotan, Daniel Erasmus
ClimateGPT is a family of domain-specific large language models designed to synthesize interdisciplinary research on climate change. The models were trained on a science-oriented dataset of 300B tokens, with two 7B models trained from scratch and additional models continuously pre-trained on a domain-specific dataset of 4.2B tokens. Each model was instruction fine-tuned on high-quality, human-generated data created in collaboration with climate scientists. To reduce hallucinations, the models were optimized for retrieval augmentation and a hierarchical retrieval strategy was proposed. To increase accessibility for non-English speakers, cascaded machine translation was used, which performs comparably to multilingual models. The models can produce in-depth answers from different perspectives, in addition to an overall answer. A suite of automatic climate-specific benchmarks was proposed to evaluate LLMs, and ClimateGPT-7B performed on par with the ten times larger Llama-2-70B Chat model while not degrading results on general domain benchmarks. Human evaluation confirmed the trends seen in benchmarks. All models were trained and evaluated using renewable energy and are publicly released. The paper introduces ClimateGPT, an LLM that addresses climate questions by accessing collective knowledge, understanding, and decision-making capacity of the human population to harness climate social intelligence. The model can provide four types of answers for each request: a natural science-related answer, an answer about the economic aspects of climate change, an answer about social impacts, and a general high-level overview. The paper outlines the technical approach, domain-specific pre-training, instruction fine-tuning, retrieval augmented generation, multilinguality, automatic evaluation, human evaluation, and responsible AI. The models were trained using a decoder-only Transformer architecture, with improvements such as increased context length and grouped-query attention. The pre-training dataset was curated from various sources, including scientific papers, IPCC reports, and other climate-related texts. The models were trained on a cosine learning rate schedule, with batch sizes and sequence lengths adjusted for different model sizes. The training hardware used a high-performance computing cluster powered by hydropower. The models were evaluated on climate-specific and general domain benchmarks, with results showing strong performance. The paper also discusses the challenges of domain adaptation, the importance of data quality, and the need for responsible AI practices. The models were trained and evaluated using renewable energy and are publicly released.ClimateGPT is a family of domain-specific large language models designed to synthesize interdisciplinary research on climate change. The models were trained on a science-oriented dataset of 300B tokens, with two 7B models trained from scratch and additional models continuously pre-trained on a domain-specific dataset of 4.2B tokens. Each model was instruction fine-tuned on high-quality, human-generated data created in collaboration with climate scientists. To reduce hallucinations, the models were optimized for retrieval augmentation and a hierarchical retrieval strategy was proposed. To increase accessibility for non-English speakers, cascaded machine translation was used, which performs comparably to multilingual models. The models can produce in-depth answers from different perspectives, in addition to an overall answer. A suite of automatic climate-specific benchmarks was proposed to evaluate LLMs, and ClimateGPT-7B performed on par with the ten times larger Llama-2-70B Chat model while not degrading results on general domain benchmarks. Human evaluation confirmed the trends seen in benchmarks. All models were trained and evaluated using renewable energy and are publicly released. The paper introduces ClimateGPT, an LLM that addresses climate questions by accessing collective knowledge, understanding, and decision-making capacity of the human population to harness climate social intelligence. The model can provide four types of answers for each request: a natural science-related answer, an answer about the economic aspects of climate change, an answer about social impacts, and a general high-level overview. The paper outlines the technical approach, domain-specific pre-training, instruction fine-tuning, retrieval augmented generation, multilinguality, automatic evaluation, human evaluation, and responsible AI. The models were trained using a decoder-only Transformer architecture, with improvements such as increased context length and grouped-query attention. The pre-training dataset was curated from various sources, including scientific papers, IPCC reports, and other climate-related texts. The models were trained on a cosine learning rate schedule, with batch sizes and sequence lengths adjusted for different model sizes. The training hardware used a high-performance computing cluster powered by hydropower. The models were evaluated on climate-specific and general domain benchmarks, with results showing strong performance. The paper also discusses the challenges of domain adaptation, the importance of data quality, and the need for responsible AI practices. The models were trained and evaluated using renewable energy and are publicly released.