29 Jan 2024 | Xiaoyi Dong*1,2, Pan Zhang*1, Yuhang Zang*1, Yuhang Cao1,2, Bin Wang1, Linke Ouyang1, Xilin Wei1, Songyang Zhang1, Haodong Duan1, Maosong Cao1, Wenwen Zhang1, Yining Li1, Hang Yan1, Yang Gao1, Xinyue Zhang1, Wei Li1, Jingwen Li1, Kai Chen1, Conghui He3, Xingcheng Zhang3, Yu Qiao1, Dahua Lin1,2, Jiaqi Wang1
**InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models**
InternLM-XComposer2 is a cutting-edge vision-language model that excels in free-form text-image composition and comprehension. Building on the InternLM2-7B model, InternLM-XComposer2 significantly outperforms existing multimodal models and matches or surpasses advanced models like GPT-4V and Gemini Pro in various benchmarks. The model introduces a Partial LoRA (PLoRA) approach, which applies additional LoRA parameters exclusively to image tokens, preserving the integrity of pre-trained language knowledge while balancing precise vision understanding and text composition. This design ensures robust performance in both visual and textual domains. The model's training data is meticulously curated to adhere to complex instructions, support customization with text and images, and enable high-quality and diverse writing. InternLM-XComposer2 demonstrates exceptional capabilities in detailed perception, logical reasoning, and extensive knowledge integration, making it a significant advancement in the field of multimodal understanding. The model is publicly available at \url{https://github.com/InternLM/InternLM-XComposer}.
**Keywords:** Vision-Language Models, Text-Image Composition, Partial LoRA, Multimodal Understanding, GPT-4V, Gemini Pro**InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models**
InternLM-XComposer2 is a cutting-edge vision-language model that excels in free-form text-image composition and comprehension. Building on the InternLM2-7B model, InternLM-XComposer2 significantly outperforms existing multimodal models and matches or surpasses advanced models like GPT-4V and Gemini Pro in various benchmarks. The model introduces a Partial LoRA (PLoRA) approach, which applies additional LoRA parameters exclusively to image tokens, preserving the integrity of pre-trained language knowledge while balancing precise vision understanding and text composition. This design ensures robust performance in both visual and textual domains. The model's training data is meticulously curated to adhere to complex instructions, support customization with text and images, and enable high-quality and diverse writing. InternLM-XComposer2 demonstrates exceptional capabilities in detailed perception, logical reasoning, and extensive knowledge integration, making it a significant advancement in the field of multimodal understanding. The model is publicly available at \url{https://github.com/InternLM/InternLM-XComposer}.
**Keywords:** Vision-Language Models, Text-Image Composition, Partial LoRA, Multimodal Understanding, GPT-4V, Gemini Pro