3 Jul 2024 | Pan Zhang*1, Xiaoyi Dong*1,2, Yuhang Zang*1, Yuhang Cao1, Rui Qian1,2, Lin Chen1, Qipeng Guo1, Haodong Duan1, Bin Wang1, Linke Ouyang1, Songyang Zhang1, Wenwei Zhang1, Yining Li1, Yang Gao1, Peng Sun1, Xinyue Zhang1, Wei Li1, Jingwen Li1, Wenhai Wang1,2, Hang Yan1, Conghui He3, Xingcheng Zhang3, Kai Chen1, Jifeng Dai1,1, Yu Qiao1, Dahua Lin1,2, Jiaqi Wang1,
The paper introduces InternLM-XComposer-2.5 (IXC-2.5), a versatile large vision language model designed to support long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition tasks, achieving capabilities comparable to GPT-4V with a 7B LLM backend. Trained with 24K interleaved image-text contexts, it can extend to 96K long contexts through positional encoding extrapolation. Key upgrades in IXC-2.5 include ultra-high resolution understanding, fine-grained video understanding, and multi-turn multi-image dialogue. Additionally, it supports text-image composition tasks such as crafting webpages and composing high-quality text-image articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks and surpassing or closely competing with GPT-4V and Gemini Pro on 16 key tasks. The model is publicly available at <https://github.com/InternLM/InternLM-XComposer>.The paper introduces InternLM-XComposer-2.5 (IXC-2.5), a versatile large vision language model designed to support long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition tasks, achieving capabilities comparable to GPT-4V with a 7B LLM backend. Trained with 24K interleaved image-text contexts, it can extend to 96K long contexts through positional encoding extrapolation. Key upgrades in IXC-2.5 include ultra-high resolution understanding, fine-grained video understanding, and multi-turn multi-image dialogue. Additionally, it supports text-image composition tasks such as crafting webpages and composing high-quality text-image articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks and surpassing or closely competing with GPT-4V and Gemini Pro on 16 key tasks. The model is publicly available at <https://github.com/InternLM/InternLM-XComposer>.