30 Sep 2024 | Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, Guangjun Yu, Xiang Wan, Benyou Wang
The paper "HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale" addresses the challenge of enhancing medical multimodal capabilities in large language models (MLLMs) by leveraging high-quality medical image-text pairs from PubMed. The authors refine and reformat these pairs using MLLMs ( specifically GPT-4V) to create the PubMedVision dataset, which contains 1.3 million medical VQA samples. This dataset significantly improves the performance of MLLMs in medical multimodal tasks, as demonstrated by benchmark results. The paper also introduces HuatuoGPT-Vision, a 34B parameter medical MLLM trained on PubMedVision, which outperforms other open-source models in various medical multimodal benchmarks. The contributions include an unblinded data reformatting method, the creation of PubMedVision, and the development of HuatuoGPT-Vision, which enhances the medical multimodal capabilities of MLLMs. The study highlights the importance of high-quality medical visual knowledge in advancing medical applications of MLLMs.The paper "HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale" addresses the challenge of enhancing medical multimodal capabilities in large language models (MLLMs) by leveraging high-quality medical image-text pairs from PubMed. The authors refine and reformat these pairs using MLLMs ( specifically GPT-4V) to create the PubMedVision dataset, which contains 1.3 million medical VQA samples. This dataset significantly improves the performance of MLLMs in medical multimodal tasks, as demonstrated by benchmark results. The paper also introduces HuatuoGPT-Vision, a 34B parameter medical MLLM trained on PubMedVision, which outperforms other open-source models in various medical multimodal benchmarks. The contributions include an unblinded data reformatting method, the creation of PubMedVision, and the development of HuatuoGPT-Vision, which enhances the medical multimodal capabilities of MLLMs. The study highlights the importance of high-quality medical visual knowledge in advancing medical applications of MLLMs.