Scaling Open-Source Language Models with Longtermism

Scaling Open-Source Language Models with Longtermism

5 Jan 2024 | Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, Huazuo Gao, Kaige Gao, Wenjun Gao, Ruiqi Ge, Kang Guan, Daya Guo, Jianzhong Guo, Guangbo Hao, Zhewen Hao, Ying He, Wenjie Hu, Panpan Huang, Erhang Li, Guowei Li, Jiashi Li, Yao Li, Y.K. Li, Wenfeng Liang, Fangyun Lin, A.X. Liu, Bo Liu, Wen Liu, Xiaodong Liu, Xin Liu, Yiyuan Liu, Haoyu Lu, Shanghao Lu, Fuli Luo, Shirong Ma, Xiaotao Nie, Tian Pei, Yishi Piao, Junjie Qiu, Hui Qu, Tongzheng Ren, Zehui Ren, Chong Ruan, Zhangli Sha, Zhihong Shao, Junxiao Song, Xuecheng Su, Jingxiang Sun, Yaofeng Sun, Minghui Tang, Bingxuan Wang, Peiyi Wang, Shiyu Wang, Yaohui Wang, Yongji Wang, Tong Wu, Y. Wu, Xin Xie, Zhenda Xie, Ziwei Xie, Yiliang Xiong, Hanwei Xu, R.X. Xu, Yanhong Xu, Dejian Yang, Yuxiang You, Shuiping Yu, Xingkai Yu, B. Zhang, Haowei Zhang, Lecong Zhang, Liyue Zhang, Mingchuan Zhang, Minghua Zhang, Wentao Zhang, Yichao Zhang, Chenggang Zhao, Yao Zhao, Shangyan Zhou, Shunfeng Zhou, Qihao Zhu, Yuheng Zou
DeepSeek LLM is an open-source project aimed at advancing large language models (LLMs) with a long-term perspective. The project focuses on scaling LLMs using two prevalent configurations: 7B and 67B. DeepSeek LLM is pre-trained on a dataset of 2 trillion tokens, continuously expanding. Supervised fine-tuning (SFT) and direct preference optimization (DPO) are applied to the base models, resulting in the creation of DeepSeek Chat models. Evaluation results show that DeepSeek LLM 67B outperforms LLaMA-2 70B across various benchmarks, especially in code, mathematics, and reasoning. Open-ended evaluations reveal that DeepSeek LLM 67B Chat performs better than GPT-3.5 in both Chinese and English tasks. The project investigates scaling laws for hyperparameters, model and data scales, and different datasets. It finds that optimal model/data scaling strategies depend on data quality, with higher-quality data requiring more compute budget for model scaling. The scaling laws for hyperparameters are derived, showing that optimal batch size increases with compute budget, while optimal learning rate decreases. The project also introduces a new model scale representation, non-embedding FLOPs/token M, leading to more accurate scaling strategies. DeepSeek LLM's pre-training involves data deduplication, filtering, and remixing, with a tokenizer based on Byte-level Byte-Pair Encoding (BBPE). The model architecture follows LLaMA, with adjustments for efficiency and performance. Hyperparameters are optimized using a multi-step learning rate scheduler, and the model is trained on a large-scale infrastructure with techniques like flash attention and ZeRO-1. The alignment process includes supervised fine-tuning and DPO to improve conversational performance. Evaluation results show that DeepSeek LLM 67B outperforms other models in various benchmarks, including public and open-ended evaluations. Safety evaluations confirm that DeepSeek LLM 67B provides harmless responses. The project also addresses challenges in held-out evaluations, demonstrating the model's performance on coding, math, and instruction-following tasks. Overall, DeepSeek LLM demonstrates superior performance in multiple domains, highlighting the importance of scaling laws and data quality in LLM development.DeepSeek LLM is an open-source project aimed at advancing large language models (LLMs) with a long-term perspective. The project focuses on scaling LLMs using two prevalent configurations: 7B and 67B. DeepSeek LLM is pre-trained on a dataset of 2 trillion tokens, continuously expanding. Supervised fine-tuning (SFT) and direct preference optimization (DPO) are applied to the base models, resulting in the creation of DeepSeek Chat models. Evaluation results show that DeepSeek LLM 67B outperforms LLaMA-2 70B across various benchmarks, especially in code, mathematics, and reasoning. Open-ended evaluations reveal that DeepSeek LLM 67B Chat performs better than GPT-3.5 in both Chinese and English tasks. The project investigates scaling laws for hyperparameters, model and data scales, and different datasets. It finds that optimal model/data scaling strategies depend on data quality, with higher-quality data requiring more compute budget for model scaling. The scaling laws for hyperparameters are derived, showing that optimal batch size increases with compute budget, while optimal learning rate decreases. The project also introduces a new model scale representation, non-embedding FLOPs/token M, leading to more accurate scaling strategies. DeepSeek LLM's pre-training involves data deduplication, filtering, and remixing, with a tokenizer based on Byte-level Byte-Pair Encoding (BBPE). The model architecture follows LLaMA, with adjustments for efficiency and performance. Hyperparameters are optimized using a multi-step learning rate scheduler, and the model is trained on a large-scale infrastructure with techniques like flash attention and ZeRO-1. The alignment process includes supervised fine-tuning and DPO to improve conversational performance. Evaluation results show that DeepSeek LLM 67B outperforms other models in various benchmarks, including public and open-ended evaluations. Safety evaluations confirm that DeepSeek LLM 67B provides harmless responses. The project also addresses challenges in held-out evaluations, demonstrating the model's performance on coding, math, and instruction-following tasks. Overall, DeepSeek LLM demonstrates superior performance in multiple domains, highlighting the importance of scaling laws and data quality in LLM development.
Reach us at info@study.space
[slides] DeepSeek LLM%3A Scaling Open-Source Language Models with Longtermism | StudySpace