DeepSeek-VL: Towards Real-World Vision-Language Understanding

DeepSeek-VL: Towards Real-World Vision-Language Understanding

11 Mar 2024 | Haoyu Lu*, Wen Liu*, Bo Zhang*, Bingxuan Wang†, Kai Dong†, Bo Liu†, Jingxiang Sun†, Tongzheng Ren†, Zhuoshu Li†, Hao Yang†, Yaofeng Sun†, Chengqi Deng†, Hanwei Xu†, Zhenda Xie†, Chong Ruan†
DeepSeek-VL is an open-source Vision-Language (VL) model designed for real-world vision and language understanding. The model is built around three key dimensions: data construction, model architecture, and training strategy. In terms of data construction, the model uses a diverse and extensive dataset that includes web screenshots, PDFs, OCR, charts, and knowledge-based content. A use case taxonomy is created from real user scenarios, and an instruction-tuning dataset is constructed accordingly. This dataset significantly improves the model's user experience in practical applications. The model architecture incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024) within a fixed token budget, while maintaining a relatively low computational overhead. This design allows the model to capture critical semantic and detailed information across various visual tasks. The training strategy emphasizes the importance of strong language abilities in a Vision-Language Model. To preserve LLM capabilities during pretraining, an effective VL pretraining strategy is investigated by integrating LLM training from the beginning and carefully managing the competitive dynamics between vision and language modalities. The model starts with a focus on text and gradually adjusts the ratio to achieve a balanced integration of both modalities. The DeepSeek-VL family (1.3B and 7B models) demonstrates superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks. Both 1.3B and 7B models are publicly accessible to foster innovations based on this foundation model.DeepSeek-VL is an open-source Vision-Language (VL) model designed for real-world vision and language understanding. The model is built around three key dimensions: data construction, model architecture, and training strategy. In terms of data construction, the model uses a diverse and extensive dataset that includes web screenshots, PDFs, OCR, charts, and knowledge-based content. A use case taxonomy is created from real user scenarios, and an instruction-tuning dataset is constructed accordingly. This dataset significantly improves the model's user experience in practical applications. The model architecture incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024) within a fixed token budget, while maintaining a relatively low computational overhead. This design allows the model to capture critical semantic and detailed information across various visual tasks. The training strategy emphasizes the importance of strong language abilities in a Vision-Language Model. To preserve LLM capabilities during pretraining, an effective VL pretraining strategy is investigated by integrating LLM training from the beginning and carefully managing the competitive dynamics between vision and language modalities. The model starts with a focus on text and gradually adjusts the ratio to achieve a balanced integration of both modalities. The DeepSeek-VL family (1.3B and 7B models) demonstrates superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks. Both 1.3B and 7B models are publicly accessible to foster innovations based on this foundation model.
Reach us at info@study.space
[slides] DeepSeek-VL%3A Towards Real-World Vision-Language Understanding | StudySpace