[slides] InternLM-XComposer-2.5%3A A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

The paper introduces InternLM-XComposer-2.5 (IXC-2.5), a versatile large vision language model designed to support long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition tasks, achieving capabilities comparable to GPT-4V with a 7B LLM backend. Trained with 24K interleaved image-text contexts, it can extend to 96K long contexts through positional encoding extrapolation. Key upgrades in IXC-2.5 include ultra-high resolution understanding, fine-grained video understanding, and multi-turn multi-image dialogue. Additionally, it supports text-image composition tasks such as crafting webpages and composing high-quality text-image articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks and surpassing or closely competing with GPT-4V and Gemini Pro on 16 key tasks. The model is publicly available at <https://github.com/InternLM/InternLM-XComposer>.The paper introduces InternLM-XComposer-2.5 (IXC-2.5), a versatile large vision language model designed to support long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition tasks, achieving capabilities comparable to GPT-4V with a 7B LLM backend. Trained with 24K interleaved image-text contexts, it can extend to 96K long contexts through positional encoding extrapolation. Key upgrades in IXC-2.5 include ultra-high resolution understanding, fine-grained video understanding, and multi-turn multi-image dialogue. Additionally, it supports text-image composition tasks such as crafting webpages and composing high-quality text-image articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks and surpassing or closely competing with GPT-4V and Gemini Pro on 16 key tasks. The model is publicly available at <https://github.com/InternLM/InternLM-XComposer>.

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output