29 Jan 2024 | Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Qiao, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang
InternLM-XComposer2 is a vision-language model that excels in free-form text-image composition and comprehension. Based on InternLM2-7B, it significantly outperforms existing multimodal models and matches or surpasses GPT-4V and Gemini Pro in certain assessments. The model uses a Partial LoRA (PLoRA) approach, applying additional LoRA parameters only to image tokens to preserve pre-trained language knowledge while balancing precise vision understanding with text composition. It is trained on high-quality, diverse data for free-form text-image composition and multimodal understanding, enabling it to generate high-quality, integrated text-image content from various inputs. InternLM-XComposer2 demonstrates exceptional performance in various benchmarks, including MathVista, MMMU, AI2D, MME, MMBench, and others. It outperforms existing open-source models and performs on par with closed-source APIs. The model's capabilities include detailed perception, logical reasoning, and extensive knowledge integration, making it highly effective for multimodal understanding. InternLM-XComposer2 is publicly available at https://github.com/InternLM/InternLM-XComposer.InternLM-XComposer2 is a vision-language model that excels in free-form text-image composition and comprehension. Based on InternLM2-7B, it significantly outperforms existing multimodal models and matches or surpasses GPT-4V and Gemini Pro in certain assessments. The model uses a Partial LoRA (PLoRA) approach, applying additional LoRA parameters only to image tokens to preserve pre-trained language knowledge while balancing precise vision understanding with text composition. It is trained on high-quality, diverse data for free-form text-image composition and multimodal understanding, enabling it to generate high-quality, integrated text-image content from various inputs. InternLM-XComposer2 demonstrates exceptional performance in various benchmarks, including MathVista, MMMU, AI2D, MME, MMBench, and others. It outperforms existing open-source models and performs on par with closed-source APIs. The model's capabilities include detailed perception, logical reasoning, and extensive knowledge integration, making it highly effective for multimodal understanding. InternLM-XComposer2 is publicly available at https://github.com/InternLM/InternLM-XComposer.