Understanding EVLM%3A An Efficient Vision-Language Model for Visual Understanding

The paper introduces EVLM, an efficient multi-modal language model designed to handle visual understanding tasks. EVLM addresses the computational overhead and visual signal perception challenges faced by existing models, particularly those using single-layer ViT features. Key contributions include: 1. **Cross-Attention Mechanism**: Similar to Flamingo, EVLM employs cross-attention to enhance interaction between visual and textual inputs, reducing computational costs while maintaining effective feature dimensions. 2. **Hierarchical ViT Features**: The model utilizes hierarchical ViT features from different layers of the visual encoder to provide a comprehensive understanding of visual signals. 3. **Mixture of Experts (MoE)**: EVLM incorporates MoE to scale trainable parameters, improving model effectiveness. The training process consists of three stages: - **Multi-modal Pre-training**: Aligns images and text and models intrinsic relationships using a large-scale dataset of image-text captions and web-type multi-modal data. - **Multi-task Continual Pre-training**: Enhances high-level visual question-answering abilities through various datasets, including VQA, NLP, OCR, and detection data. - **Supervised Fine-tuning**: Finetunes the model using instruction tuning data to improve performance in dense captioning tasks, with a focus on image and video captioning. EVLM demonstrates superior performance on public benchmarks, particularly in image and video captioning tasks, achieving competitive scores and outperforming state-of-the-art models in general VQA, text-oriented VQA, and general multimodal benchmarks. The model's ability to handle complex visual and textual information is highlighted, making it a robust solution for multimodal understanding and reasoning tasks.The paper introduces EVLM, an efficient multi-modal language model designed to handle visual understanding tasks. EVLM addresses the computational overhead and visual signal perception challenges faced by existing models, particularly those using single-layer ViT features. Key contributions include: 1. **Cross-Attention Mechanism**: Similar to Flamingo, EVLM employs cross-attention to enhance interaction between visual and textual inputs, reducing computational costs while maintaining effective feature dimensions. 2. **Hierarchical ViT Features**: The model utilizes hierarchical ViT features from different layers of the visual encoder to provide a comprehensive understanding of visual signals. 3. **Mixture of Experts (MoE)**: EVLM incorporates MoE to scale trainable parameters, improving model effectiveness. The training process consists of three stages: - **Multi-modal Pre-training**: Aligns images and text and models intrinsic relationships using a large-scale dataset of image-text captions and web-type multi-modal data. - **Multi-task Continual Pre-training**: Enhances high-level visual question-answering abilities through various datasets, including VQA, NLP, OCR, and detection data. - **Supervised Fine-tuning**: Finetunes the model using instruction tuning data to improve performance in dense captioning tasks, with a focus on image and video captioning. EVLM demonstrates superior performance on public benchmarks, particularly in image and video captioning tasks, achieving competitive scores and outperforming state-of-the-art models in general VQA, text-oriented VQA, and general multimodal benchmarks. The model's ability to handle complex visual and textual information is highlighted, making it a robust solution for multimodal understanding and reasoning tasks.

EVLM: An Efficient Vision-Language Model for Visual Understanding

19 Jul 2024 | Kaibing Chen Dong Shen Hanwen Zhong Huasong Zhong Kui Xia Di Xu Wei Yuan Yifei Hu Bin Wen Tianke Zhang Changyi Liu Dewen Fan Huihui Xiao Jiahong Wu Fan Yang† Size Li Di Zhang