17 Jun 2024 | Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang
The paper "Unveiling Encoder-Free Vision-Language Models" by Haiwen Diao, Yufeng Cui, Xiaotong Li, Yuez Wang, Huchuan Lu, and Xinlong Wang introduces EVE, an encoder-free vision-language model designed to address the limitations of existing encoder-based models. Traditional vision-language models (VLMs) rely on vision encoders to extract visual features, which can impose strong inductive biases and hinder flexibility and efficiency. The authors propose a training recipe for pure VLMs that do not use vision encoders, aiming to bridge the gap between encoder-based and encoder-free models.
Key contributions of the paper include:
1. **Efficient Training Strategy**: The authors develop a training strategy that integrates vision and language representations within a unified decoder, enhancing visual recognition capabilities through extra supervision.
2. **Model Architecture**: EVE is built on top of the Vicuna-7B model, using a lightweight patch embedding layer and a patch aligning layer to efficiently encode images and text inputs.
3. **Training Procedure**: The training process is divided into three stages: LLM-guided pre-training, generative pre-training, and supervised fine-tuning. Each stage is designed to stabilize the training process and improve model performance.
4. **Performance and Evaluation**: EVE demonstrates superior performance on multiple vision-language benchmarks compared to encoder-free models like Fuyu-8B, using only 35M publicly accessible data. It also outperforms encoder-based models like InternVL-Chat and mPLUG-Ow12.
The paper highlights the benefits of EVE, including efficient deployment, reduced latency, and the ability to handle high-resolution images with arbitrary aspect ratios. However, it also acknowledges limitations, such as the performance gap with state-of-the-art encoder-based models and the need for more training data to fully match their capabilities. The authors suggest future directions, including scaling up LLM capacity and exploring multi-modal integration.The paper "Unveiling Encoder-Free Vision-Language Models" by Haiwen Diao, Yufeng Cui, Xiaotong Li, Yuez Wang, Huchuan Lu, and Xinlong Wang introduces EVE, an encoder-free vision-language model designed to address the limitations of existing encoder-based models. Traditional vision-language models (VLMs) rely on vision encoders to extract visual features, which can impose strong inductive biases and hinder flexibility and efficiency. The authors propose a training recipe for pure VLMs that do not use vision encoders, aiming to bridge the gap between encoder-based and encoder-free models.
Key contributions of the paper include:
1. **Efficient Training Strategy**: The authors develop a training strategy that integrates vision and language representations within a unified decoder, enhancing visual recognition capabilities through extra supervision.
2. **Model Architecture**: EVE is built on top of the Vicuna-7B model, using a lightweight patch embedding layer and a patch aligning layer to efficiently encode images and text inputs.
3. **Training Procedure**: The training process is divided into three stages: LLM-guided pre-training, generative pre-training, and supervised fine-tuning. Each stage is designed to stabilize the training process and improve model performance.
4. **Performance and Evaluation**: EVE demonstrates superior performance on multiple vision-language benchmarks compared to encoder-free models like Fuyu-8B, using only 35M publicly accessible data. It also outperforms encoder-based models like InternVL-Chat and mPLUG-Ow12.
The paper highlights the benefits of EVE, including efficient deployment, reduced latency, and the ability to handle high-resolution images with arbitrary aspect ratios. However, it also acknowledges limitations, such as the performance gap with state-of-the-art encoder-based models and the need for more training data to fully match their capabilities. The authors suggest future directions, including scaling up LLM capacity and exploring multi-modal integration.