25 Oct 2023 | Aohan Zeng, Xiao Liu, Zhengxia Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Peng Zhang, Yuxiao Dong, Jie Tang
GLM-130B is a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an open-source model that aims to demonstrate how large-scale language models can be successfully pre-trained. The model was trained on 400 billion tokens over 60 days using 96 NVIDIA DGX-A100 GPU nodes. It outperforms GPT-3 on various English benchmarks and significantly outperforms ERNIE TITAN 3.0 on Chinese benchmarks. GLM-130B also achieves high performance in zero-shot and few-shot learning tasks. The model is quantized to INT4 precision without post-training, allowing it to run on affordable GPUs such as 4×RTX 3090 or 8×RTX 2080 Ti. The model's code, training logs, and related toolkit are open-sourced. The training process involved addressing technical challenges such as loss spikes and divergence, and the model's design choices, including the use of bidirectional attention and multi-task instruction pre-training, contributed to its performance. The model's open-source nature allows for further research and development in the field of large language models.GLM-130B is a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an open-source model that aims to demonstrate how large-scale language models can be successfully pre-trained. The model was trained on 400 billion tokens over 60 days using 96 NVIDIA DGX-A100 GPU nodes. It outperforms GPT-3 on various English benchmarks and significantly outperforms ERNIE TITAN 3.0 on Chinese benchmarks. GLM-130B also achieves high performance in zero-shot and few-shot learning tasks. The model is quantized to INT4 precision without post-training, allowing it to run on affordable GPUs such as 4×RTX 3090 or 8×RTX 2080 Ti. The model's code, training logs, and related toolkit are open-sourced. The training process involved addressing technical challenges such as loss spikes and divergence, and the model's design choices, including the use of bidirectional attention and multi-task instruction pre-training, contributed to its performance. The model's open-source nature allows for further research and development in the field of large language models.