HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

30 Sep 2024 | Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, Guangjun Yu, Xiang Wan, Benyou Wang
This paper introduces HuatuoGPT-Vision, a medical multimodal large language model (MLLM) trained on the PubMedVision dataset, which contains 1.3 million medical visual question-answering (VQA) samples. The dataset is constructed by refining medical image-text pairs from PubMed and using GPT-4V as an "unblinded" reformatter to denoise and reformat the data. The resulting dataset, PubMedVision, significantly enhances the medical multimodal capabilities of current MLLMs, as demonstrated by improved performance on benchmarks such as MMMU Health & Medicine. The paper also presents a comprehensive evaluation of the dataset's quality, showing that it outperforms existing methods in terms of accuracy, relevance, completeness, and usefulness. The authors train a 34B parameter medical MLLM, HuatuoGPT-Vision, which demonstrates superior performance in medical multimodal scenarios among open-source MLLMs. The study highlights the importance of high-quality, large-scale medical multimodal datasets for improving the performance of MLLMs in medical applications. The paper also discusses the challenges of medical data privacy, annotation costs, and data noise, and proposes a data engineering approach to address these issues. The results show that PubMedVision significantly improves the medical multimodal capabilities of MLLMs, with the trained model outperforming other open-source models in various benchmarks. The paper also presents a detailed analysis of the data construction process, including data filtering, reformatting, and evaluation. The authors conclude that PubMedVision is a valuable resource for advancing medical MLLMs and that further research is needed to improve data quality and expand the scope of medical applications.This paper introduces HuatuoGPT-Vision, a medical multimodal large language model (MLLM) trained on the PubMedVision dataset, which contains 1.3 million medical visual question-answering (VQA) samples. The dataset is constructed by refining medical image-text pairs from PubMed and using GPT-4V as an "unblinded" reformatter to denoise and reformat the data. The resulting dataset, PubMedVision, significantly enhances the medical multimodal capabilities of current MLLMs, as demonstrated by improved performance on benchmarks such as MMMU Health & Medicine. The paper also presents a comprehensive evaluation of the dataset's quality, showing that it outperforms existing methods in terms of accuracy, relevance, completeness, and usefulness. The authors train a 34B parameter medical MLLM, HuatuoGPT-Vision, which demonstrates superior performance in medical multimodal scenarios among open-source MLLMs. The study highlights the importance of high-quality, large-scale medical multimodal datasets for improving the performance of MLLMs in medical applications. The paper also discusses the challenges of medical data privacy, annotation costs, and data noise, and proposes a data engineering approach to address these issues. The results show that PubMedVision significantly improves the medical multimodal capabilities of MLLMs, with the trained model outperforming other open-source models in various benchmarks. The paper also presents a detailed analysis of the data construction process, including data filtering, reformatting, and evaluation. The authors conclude that PubMedVision is a valuable resource for advancing medical MLLMs and that further research is needed to improve data quality and expand the scope of medical applications.
Reach us at info@futurestudyspace.com