11 Mar 2024 | Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Hanwei Xu, Zhenda Xie, Chong Ruan
DeepSeek-VL is an open-source Vision-Language (VL) model designed for real-world applications, focusing on diverse and scalable data construction, efficient model architecture, and effective training strategies. The model aims to improve user experience in practical scenarios by ensuring comprehensive representation of real-world contexts, including web screenshots, PDFs, OCR, charts, and knowledge-based content. The architecture incorporates a hybrid vision encoder that efficiently processes high-resolution images while maintaining computational efficiency. The training strategy emphasizes the preservation of language capabilities during pretraining by integrating LLM training from the beginning and carefully managing the balance between vision and language modalities. DeepSeek-VL has achieved state-of-the-art or competitive performance across various visual-language benchmarks while maintaining robust language-centric performance. The model is publicly available in two versions, 1.3B and 7B, to foster further innovation and applications.DeepSeek-VL is an open-source Vision-Language (VL) model designed for real-world applications, focusing on diverse and scalable data construction, efficient model architecture, and effective training strategies. The model aims to improve user experience in practical scenarios by ensuring comprehensive representation of real-world contexts, including web screenshots, PDFs, OCR, charts, and knowledge-based content. The architecture incorporates a hybrid vision encoder that efficiently processes high-resolution images while maintaining computational efficiency. The training strategy emphasizes the preservation of language capabilities during pretraining by integrating LLM training from the beginning and carefully managing the balance between vision and language modalities. DeepSeek-VL has achieved state-of-the-art or competitive performance across various visual-language benchmarks while maintaining robust language-centric performance. The model is publicly available in two versions, 1.3B and 7B, to foster further innovation and applications.