Unveiling Encoder-Free Vision-Language Models

Unveiling Encoder-Free Vision-Language Models

17 Jun 2024 | Haiwen Diao¹,²*, Yufeng Cui²*, Xiaotong Li³,², Yueze Wang², Huchuan Lu¹†, Xinlong Wang²†
This paper introduces EVE, an encoder-free vision-language model (VLM) that eliminates the need for vision encoders, achieving performance comparable to encoder-based models. Traditional VLMs rely on vision encoders to extract visual features, which can limit flexibility and efficiency. EVE addresses this by training a unified decoder-only architecture that directly processes both vision and language inputs. Through extensive experiments, the authors identify key strategies for training encoder-free VLMs: (1) integrating vision-language representation within a single decoder, and (2) enhancing visual recognition via additional supervision. These strategies enable EVE to efficiently train and process inputs, even with high-resolution images of arbitrary aspect ratios. EVE demonstrates impressive performance across multiple vision-language benchmarks, outperforming the encoder-based Fuyu-8B model. It achieves this using only 35M publicly available data, highlighting the effectiveness of its training approach. The model's architecture is designed to be efficient and transparent, offering a practical path for developing encoder-free VLMs. The paper also discusses the challenges of training encoder-free models, including the need for effective representation alignment and the importance of visual supervision. The authors propose a hierarchical aggregation strategy to integrate intermediate features from multiple layers, enhancing visual perception and accelerating convergence. EVE is trained using a three-stage process: (1) LLM-guided pre-training to align vision and language modalities, (2) generative pre-training to enhance vision-language understanding, and (3) supervised fine-tuning to improve language capabilities. The model is evaluated on various benchmarks, including academic, open-world, and scientific tasks, showing strong performance. The paper also presents ablation studies to validate the effectiveness of different configurations and training strategies. Overall, EVE provides a promising solution for developing efficient and effective encoder-free VLMs, with potential applications in multi-modal input processing and deployment efficiency.This paper introduces EVE, an encoder-free vision-language model (VLM) that eliminates the need for vision encoders, achieving performance comparable to encoder-based models. Traditional VLMs rely on vision encoders to extract visual features, which can limit flexibility and efficiency. EVE addresses this by training a unified decoder-only architecture that directly processes both vision and language inputs. Through extensive experiments, the authors identify key strategies for training encoder-free VLMs: (1) integrating vision-language representation within a single decoder, and (2) enhancing visual recognition via additional supervision. These strategies enable EVE to efficiently train and process inputs, even with high-resolution images of arbitrary aspect ratios. EVE demonstrates impressive performance across multiple vision-language benchmarks, outperforming the encoder-based Fuyu-8B model. It achieves this using only 35M publicly available data, highlighting the effectiveness of its training approach. The model's architecture is designed to be efficient and transparent, offering a practical path for developing encoder-free VLMs. The paper also discusses the challenges of training encoder-free models, including the need for effective representation alignment and the importance of visual supervision. The authors propose a hierarchical aggregation strategy to integrate intermediate features from multiple layers, enhancing visual perception and accelerating convergence. EVE is trained using a three-stage process: (1) LLM-guided pre-training to align vision and language modalities, (2) generative pre-training to enhance vision-language understanding, and (3) supervised fine-tuning to improve language capabilities. The model is evaluated on various benchmarks, including academic, open-world, and scientific tasks, showing strong performance. The paper also presents ablation studies to validate the effectiveness of different configurations and training strategies. Overall, EVE provides a promising solution for developing efficient and effective encoder-free VLMs, with potential applications in multi-modal input processing and deployment efficiency.
Reach us at info@study.space