DeepSeek LLM is an open-source project aimed at advancing large language models (LLMs) with a long-term perspective. The project focuses on scaling LLMs using two prevalent configurations: 7B and 67B. DeepSeek LLM is pre-trained on a dataset of 2 trillion tokens, continuously expanding. Supervised fine-tuning (SFT) and direct preference optimization (DPO) are applied to the base models, resulting in the creation of DeepSeek Chat models. Evaluation results show that DeepSeek LLM 67B outperforms LLaMA-2 70B across various benchmarks, especially in code, mathematics, and reasoning. Open-ended evaluations reveal that DeepSeek LLM 67B Chat performs better than GPT-3.5 in both Chinese and English tasks.
The project investigates scaling laws for hyperparameters, model and data scales, and different datasets. It finds that optimal model/data scaling strategies depend on data quality, with higher-quality data requiring more compute budget for model scaling. The scaling laws for hyperparameters are derived, showing that optimal batch size increases with compute budget, while optimal learning rate decreases. The project also introduces a new model scale representation, non-embedding FLOPs/token M, leading to more accurate scaling strategies.
DeepSeek LLM's pre-training involves data deduplication, filtering, and remixing, with a tokenizer based on Byte-level Byte-Pair Encoding (BBPE). The model architecture follows LLaMA, with adjustments for efficiency and performance. Hyperparameters are optimized using a multi-step learning rate scheduler, and the model is trained on a large-scale infrastructure with techniques like flash attention and ZeRO-1.
The alignment process includes supervised fine-tuning and DPO to improve conversational performance. Evaluation results show that DeepSeek LLM 67B outperforms other models in various benchmarks, including public and open-ended evaluations. Safety evaluations confirm that DeepSeek LLM 67B provides harmless responses. The project also addresses challenges in held-out evaluations, demonstrating the model's performance on coding, math, and instruction-following tasks. Overall, DeepSeek LLM demonstrates superior performance in multiple domains, highlighting the importance of scaling laws and data quality in LLM development.DeepSeek LLM is an open-source project aimed at advancing large language models (LLMs) with a long-term perspective. The project focuses on scaling LLMs using two prevalent configurations: 7B and 67B. DeepSeek LLM is pre-trained on a dataset of 2 trillion tokens, continuously expanding. Supervised fine-tuning (SFT) and direct preference optimization (DPO) are applied to the base models, resulting in the creation of DeepSeek Chat models. Evaluation results show that DeepSeek LLM 67B outperforms LLaMA-2 70B across various benchmarks, especially in code, mathematics, and reasoning. Open-ended evaluations reveal that DeepSeek LLM 67B Chat performs better than GPT-3.5 in both Chinese and English tasks.
The project investigates scaling laws for hyperparameters, model and data scales, and different datasets. It finds that optimal model/data scaling strategies depend on data quality, with higher-quality data requiring more compute budget for model scaling. The scaling laws for hyperparameters are derived, showing that optimal batch size increases with compute budget, while optimal learning rate decreases. The project also introduces a new model scale representation, non-embedding FLOPs/token M, leading to more accurate scaling strategies.
DeepSeek LLM's pre-training involves data deduplication, filtering, and remixing, with a tokenizer based on Byte-level Byte-Pair Encoding (BBPE). The model architecture follows LLaMA, with adjustments for efficiency and performance. Hyperparameters are optimized using a multi-step learning rate scheduler, and the model is trained on a large-scale infrastructure with techniques like flash attention and ZeRO-1.
The alignment process includes supervised fine-tuning and DPO to improve conversational performance. Evaluation results show that DeepSeek LLM 67B outperforms other models in various benchmarks, including public and open-ended evaluations. Safety evaluations confirm that DeepSeek LLM 67B provides harmless responses. The project also addresses challenges in held-out evaluations, demonstrating the model's performance on coding, math, and instruction-following tasks. Overall, DeepSeek LLM demonstrates superior performance in multiple domains, highlighting the importance of scaling laws and data quality in LLM development.