**BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains**
This paper introduces BioMistral, an open-source Large Language Model (LLM) tailored for the biomedical domain, built on the Mistral model and further pre-trained on PubMed Central. The authors conduct a comprehensive evaluation of BioMistral on 10 established medical question-answering (QA) tasks in English, demonstrating superior performance compared to existing open-source medical models and competitive edge against proprietary counterparts. They also explore lightweight models through quantization and model merging approaches. Additionally, they evaluate the multilingual generalization of BioMistral by translating the benchmark into 7 other languages, marking the first large-scale multilingual evaluation of LLMs in the medical domain. All datasets, multilingual benchmarks, scripts, and models are freely released under the Apache 2.0 license.
The paper highlights the challenges and opportunities in integrating LLMs into healthcare, emphasizing the need for specialized models that can be used on consumer-grade devices while maintaining performance. BioMistral 7B, the focus of the study, is constructed using a pre-training dataset from PMC Open Access Subset, optimized for training efficiency, and evaluated through various methods including few-shot in-context learning and supervised fine-tuning. The authors also explore model merging techniques and quantization strategies to enhance performance and reduce resource requirements.
The evaluation protocol includes a benchmark of 10 medical QA tasks in English, their multilingual translation, instruction prompting, and supervised fine-tuning. The results show that BioMistral 7B outperforms other models in few-shot learning and supervised fine-tuning scenarios, with significant improvements in accuracy. Model merging strategies further enhance performance, and quantization techniques improve efficiency without significant loss in accuracy. The study also assesses calibration and truthfulness, finding that BioMistral 7B exhibits better calibration and truthfulness compared to other models.
The paper concludes by discussing future work, including human evaluation of generation quality, enhancing multilingual and chat capabilities, and improving calibration and reliability. The authors acknowledge the substantial computational resources required for the study and the ethical considerations in using such models for medical applications.**BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains**
This paper introduces BioMistral, an open-source Large Language Model (LLM) tailored for the biomedical domain, built on the Mistral model and further pre-trained on PubMed Central. The authors conduct a comprehensive evaluation of BioMistral on 10 established medical question-answering (QA) tasks in English, demonstrating superior performance compared to existing open-source medical models and competitive edge against proprietary counterparts. They also explore lightweight models through quantization and model merging approaches. Additionally, they evaluate the multilingual generalization of BioMistral by translating the benchmark into 7 other languages, marking the first large-scale multilingual evaluation of LLMs in the medical domain. All datasets, multilingual benchmarks, scripts, and models are freely released under the Apache 2.0 license.
The paper highlights the challenges and opportunities in integrating LLMs into healthcare, emphasizing the need for specialized models that can be used on consumer-grade devices while maintaining performance. BioMistral 7B, the focus of the study, is constructed using a pre-training dataset from PMC Open Access Subset, optimized for training efficiency, and evaluated through various methods including few-shot in-context learning and supervised fine-tuning. The authors also explore model merging techniques and quantization strategies to enhance performance and reduce resource requirements.
The evaluation protocol includes a benchmark of 10 medical QA tasks in English, their multilingual translation, instruction prompting, and supervised fine-tuning. The results show that BioMistral 7B outperforms other models in few-shot learning and supervised fine-tuning scenarios, with significant improvements in accuracy. Model merging strategies further enhance performance, and quantization techniques improve efficiency without significant loss in accuracy. The study also assesses calibration and truthfulness, finding that BioMistral 7B exhibits better calibration and truthfulness compared to other models.
The paper concludes by discussing future work, including human evaluation of generation quality, enhancing multilingual and chat capabilities, and improving calibration and reliability. The authors acknowledge the substantial computational resources required for the study and the ethical considerations in using such models for medical applications.