2 Jun 2024 | Pengcheng Qiu*, Chaoyi Wu*, Xiaoman Zhang, Weixiong Lin, Haicheng Wang, Ya Zhang, Yanfeng Wang and Weidi Xie
This paper presents a comprehensive approach to building a multilingual medical language model (MMedLM) for healthcare. The authors introduce a large-scale multilingual medical corpus (MMedC), containing 25.5B tokens across six languages, and a new multilingual medical benchmark (MMedBench) for evaluating the performance of medical language models. They also present a series of models, including MMed-Llama 3, which achieves superior performance on both MMedBench and English benchmarks, even rivaling GPT-4. The MMedC is constructed by aggregating medical-related content from four sources: filtering medical content from general corpora, medical textbooks, medical websites, and existing small-scale medical corpora. The MMedBench is a comprehensive benchmark that includes 53,566 QA pairs across six languages, with accompanying rationale explanations generated by GPT-4. The authors evaluate a wide range of LLMs, including closed-source models like GPT-3.5, GPT-4, and Gemini-1.0 pro, as well as open-source models like Mistral, InternLM 2, and Llama 3, under zero-shot, PEFT, and full fine-tuning settings. The results show that models trained on MMedC, such as MMed-Llama 3, achieve significantly better performance on both multiple-choice questions and rationale generation tasks. The authors also conduct ablation studies to analyze the impact of different data components on model performance. The study highlights the importance of specialized corpora for improving the performance of LLMs in multilingual medical contexts. The authors also discuss the potential impact of their work on general medical AI development, retrieval augmented generation, and clinical applications. They also identify limitations, such as the potential for bias in the data and the need for further research on explainability and multilingual coverage. The authors conclude that their work provides a valuable resource for the development of multilingual medical LLMs and encourages future research in this area.This paper presents a comprehensive approach to building a multilingual medical language model (MMedLM) for healthcare. The authors introduce a large-scale multilingual medical corpus (MMedC), containing 25.5B tokens across six languages, and a new multilingual medical benchmark (MMedBench) for evaluating the performance of medical language models. They also present a series of models, including MMed-Llama 3, which achieves superior performance on both MMedBench and English benchmarks, even rivaling GPT-4. The MMedC is constructed by aggregating medical-related content from four sources: filtering medical content from general corpora, medical textbooks, medical websites, and existing small-scale medical corpora. The MMedBench is a comprehensive benchmark that includes 53,566 QA pairs across six languages, with accompanying rationale explanations generated by GPT-4. The authors evaluate a wide range of LLMs, including closed-source models like GPT-3.5, GPT-4, and Gemini-1.0 pro, as well as open-source models like Mistral, InternLM 2, and Llama 3, under zero-shot, PEFT, and full fine-tuning settings. The results show that models trained on MMedC, such as MMed-Llama 3, achieve significantly better performance on both multiple-choice questions and rationale generation tasks. The authors also conduct ablation studies to analyze the impact of different data components on model performance. The study highlights the importance of specialized corpora for improving the performance of LLMs in multilingual medical contexts. The authors also discuss the potential impact of their work on general medical AI development, retrieval augmented generation, and clinical applications. They also identify limitations, such as the potential for bias in the data and the need for further research on explainability and multilingual coverage. The authors conclude that their work provides a valuable resource for the development of multilingual medical LLMs and encourages future research in this area.