[slides and audio] InternLM-XComposer2%3A Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

**InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models** InternLM-XComposer2 is a cutting-edge vision-language model that excels in free-form text-image composition and comprehension. Building on the InternLM2-7B model, InternLM-XComposer2 significantly outperforms existing multimodal models and matches or surpasses advanced models like GPT-4V and Gemini Pro in various benchmarks. The model introduces a Partial LoRA (PLoRA) approach, which applies additional LoRA parameters exclusively to image tokens, preserving the integrity of pre-trained language knowledge while balancing precise vision understanding and text composition. This design ensures robust performance in both visual and textual domains. The model's training data is meticulously curated to adhere to complex instructions, support customization with text and images, and enable high-quality and diverse writing. InternLM-XComposer2 demonstrates exceptional capabilities in detailed perception, logical reasoning, and extensive knowledge integration, making it a significant advancement in the field of multimodal understanding. The model is publicly available at \url{https://github.com/InternLM/InternLM-XComposer}. **Keywords:** Vision-Language Models, Text-Image Composition, Partial LoRA, Multimodal Understanding, GPT-4V, Gemini Pro**InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models** InternLM-XComposer2 is a cutting-edge vision-language model that excels in free-form text-image composition and comprehension. Building on the InternLM2-7B model, InternLM-XComposer2 significantly outperforms existing multimodal models and matches or surpasses advanced models like GPT-4V and Gemini Pro in various benchmarks. The model introduces a Partial LoRA (PLoRA) approach, which applies additional LoRA parameters exclusively to image tokens, preserving the integrity of pre-trained language knowledge while balancing precise vision understanding and text composition. This design ensures robust performance in both visual and textual domains. The model's training data is meticulously curated to adhere to complex instructions, support customization with text and images, and enable high-quality and diverse writing. InternLM-XComposer2 demonstrates exceptional capabilities in detailed perception, logical reasoning, and extensive knowledge integration, making it a significant advancement in the field of multimodal understanding. The model is publicly available at \url{https://github.com/InternLM/InternLM-XComposer}. **Keywords:** Vision-Language Models, Text-Image Composition, Partial LoRA, Multimodal Understanding, GPT-4V, Gemini Pro