2024 | Kaibing Chen, Dong Shen, Hanwen Zhong, Huasong Zhong, Kui Xia, Di Xu, Wei Yuan, Yifei Hu, Bin Wen, Tianke Zhang, Changyi Liu, Dewen Fan, Huihui Xiao, Jiahong Wu, Fan Yang, Size Li, Di Zhang
EVLM is an efficient vision-language model designed to enhance visual understanding. The model employs cross-attention similar to Flamingo for interaction between visual and textual inputs, uses hierarchical ViT features, and incorporates the Mixture of Experts (MoE) mechanism to improve performance. It achieves competitive results on public multi-modal benchmarks and excels in tasks like image and video captioning. The model's architecture includes a visual encoder, a large language model, and a Gated Cross Attention Layer. The visual encoder uses a 4.4B EVA2-CLIP-E-Plus model to extract hierarchical visual features. The Gated Cross-Attention Layer uses learnable tokens to represent visual features, and the language model is conditioned on visual inputs through a gated cross-attention layer. The model is trained through multi-modal pre-training, multi-task continual pre-training, and multi-modal instruction fine-tuning. It achieves significant improvements in training efficiency and performance on various tasks, including image captioning and video captioning. The model's performance is evaluated on multiple benchmarks, demonstrating its effectiveness in visual understanding and reasoning. The model's architecture and training process are optimized to balance computational efficiency and information richness, making it suitable for a wide range of visual understanding tasks.EVLM is an efficient vision-language model designed to enhance visual understanding. The model employs cross-attention similar to Flamingo for interaction between visual and textual inputs, uses hierarchical ViT features, and incorporates the Mixture of Experts (MoE) mechanism to improve performance. It achieves competitive results on public multi-modal benchmarks and excels in tasks like image and video captioning. The model's architecture includes a visual encoder, a large language model, and a Gated Cross Attention Layer. The visual encoder uses a 4.4B EVA2-CLIP-E-Plus model to extract hierarchical visual features. The Gated Cross-Attention Layer uses learnable tokens to represent visual features, and the language model is conditioned on visual inputs through a gated cross-attention layer. The model is trained through multi-modal pre-training, multi-task continual pre-training, and multi-modal instruction fine-tuning. It achieves significant improvements in training efficiency and performance on various tasks, including image captioning and video captioning. The model's performance is evaluated on multiple benchmarks, demonstrating its effectiveness in visual understanding and reasoning. The model's architecture and training process are optimized to balance computational efficiency and information richness, making it suitable for a wide range of visual understanding tasks.