3 Jul 2024 | Pan Zhang*, Xiaoyi Dong*1,2, Yuhang Zang*, Yuhang Cao1, Rui Qian1,2, Lin Chen1, Qipeng Guo1, Haodong Duan1, Bin Wang1, Linke Ouyang1, Songyang Zhang1, Wenwei Zhang1, Yining Li1, Yang Gao1, Peng Sun1, Xinyue Zhang1, Wei Li1, Jingwen Li1, Wenhai Wang1,2, Hang Yan1, Conghui He3, Xingcheng Zhang3, Kai Chen1, Jifeng Dai1,1, Yu Qiao1, Dahua Lin1,2, Jiaqi Wang1,2
InternLM-XComposer-2.5 is a versatile large vision language model that supports long-contextual input and output. It excels in text-image comprehension and composition tasks, achieving capabilities comparable to GPT-4V with a 7B LLM backend. Trained on 24K interleaved image-text contexts, it can extend to 96K long contexts via RoPE extrapolation. The model features three major upgrades in vision-language comprehension: ultra-high resolution understanding, fine-grained video understanding, and multi-turn multi-image dialogue. It also supports two applications using extra LoRA parameters: crafting webpages and composing high-quality text-image articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source models on 16 benchmarks and surpassing or competing closely with GPT-4V and Gemini Pro on 16 key tasks. The model is publicly available at https://github.com/InternLM/InternLM-XComposer. It supports audio input and output using open-source tools and has been tested on various video and image benchmarks, demonstrating strong performance in video understanding, high-resolution image analysis, and webpage generation. The model is capable of generating high-quality text-image articles through a scalable pipeline that integrates supervised fine-tuning, reward modeling, preference data collection, and DPO alignment. IXC-2.5 is designed to handle diverse multimodal tasks and has been evaluated on various benchmarks, showing competitive performance with closed-source APIs and open-source models. The model is capable of generating high-quality text-image articles and has been tested on various benchmarks, showing competitive performance with closed-source APIs and open-source models.InternLM-XComposer-2.5 is a versatile large vision language model that supports long-contextual input and output. It excels in text-image comprehension and composition tasks, achieving capabilities comparable to GPT-4V with a 7B LLM backend. Trained on 24K interleaved image-text contexts, it can extend to 96K long contexts via RoPE extrapolation. The model features three major upgrades in vision-language comprehension: ultra-high resolution understanding, fine-grained video understanding, and multi-turn multi-image dialogue. It also supports two applications using extra LoRA parameters: crafting webpages and composing high-quality text-image articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source models on 16 benchmarks and surpassing or competing closely with GPT-4V and Gemini Pro on 16 key tasks. The model is publicly available at https://github.com/InternLM/InternLM-XComposer. It supports audio input and output using open-source tools and has been tested on various video and image benchmarks, demonstrating strong performance in video understanding, high-resolution image analysis, and webpage generation. The model is capable of generating high-quality text-image articles through a scalable pipeline that integrates supervised fine-tuning, reward modeling, preference data collection, and DPO alignment. IXC-2.5 is designed to handle diverse multimodal tasks and has been evaluated on various benchmarks, showing competitive performance with closed-source APIs and open-source models. The model is capable of generating high-quality text-image articles and has been tested on various benchmarks, showing competitive performance with closed-source APIs and open-source models.