2 Jun 2024 | Pengcheng Qiu*,1,2, Chaoyi Wu*,1,2, Xiaoman Zhang1,2, Weixiong Lin1,2, Haicheng Wang1, Ya Zhang1,2, Yanfeng Wang1,2,† and Weidi Xie1,2,†
The paper presents a comprehensive approach to developing multilingual medical language models, aiming to benefit a linguistically diverse audience. The key contributions include:
1. **Construction of MMedC**: A large-scale multilingual medical corpus containing approximately 25.5 billion tokens across six main languages (English, Chinese, Japanese, French, Russian, and Spanish). This corpus is designed for auto-regressive domain adaptation of general LLMs.
2. **MMedBench Benchmark**: A new multilingual medical question-answering (QA) benchmark with rationale, enabling evaluation of multi-choice accuracy and rationale generation under zero-shot and fine-tuning settings.
3. **Model Evaluation**: Assessment of various open-source LLMs on MMedBench, including those trained on MMedC. The final model, MMed-Llama 3, with only 8 billion parameters, outperforms other models on both MMedBench and English benchmarks, even rivaling GPT-4.
The paper highlights the importance of specialized medical corpora and the effectiveness of auto-regressive training on MMedC. The evaluation results show that MMed-Llama 3 achieves superior performance, particularly in multi-choice accuracy and rationale generation. The study also discusses the impact of different data sources and the benefits of incorporating rationale data during fine-tuning. Additionally, the paper explores the potential of multilingual medical LLMs in promoting general medical AI development, improving retrieval-augmented generation, and addressing clinical needs such as language barriers and cultural sensitivities. Limitations and future directions are also discussed, including the need for bias control, explainability, and expanding the language coverage.The paper presents a comprehensive approach to developing multilingual medical language models, aiming to benefit a linguistically diverse audience. The key contributions include:
1. **Construction of MMedC**: A large-scale multilingual medical corpus containing approximately 25.5 billion tokens across six main languages (English, Chinese, Japanese, French, Russian, and Spanish). This corpus is designed for auto-regressive domain adaptation of general LLMs.
2. **MMedBench Benchmark**: A new multilingual medical question-answering (QA) benchmark with rationale, enabling evaluation of multi-choice accuracy and rationale generation under zero-shot and fine-tuning settings.
3. **Model Evaluation**: Assessment of various open-source LLMs on MMedBench, including those trained on MMedC. The final model, MMed-Llama 3, with only 8 billion parameters, outperforms other models on both MMedBench and English benchmarks, even rivaling GPT-4.
The paper highlights the importance of specialized medical corpora and the effectiveness of auto-regressive training on MMedC. The evaluation results show that MMed-Llama 3 achieves superior performance, particularly in multi-choice accuracy and rationale generation. The study also discusses the impact of different data sources and the benefits of incorporating rationale data during fine-tuning. Additionally, the paper explores the potential of multilingual medical LLMs in promoting general medical AI development, improving retrieval-augmented generation, and addressing clinical needs such as language barriers and cultural sensitivities. Limitations and future directions are also discussed, including the need for bias control, explainability, and expanding the language coverage.