BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text

BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text

March 2024 | Elliot Bolton, Abhinav Venigalla, Michihiro Yasunaga, David Hall, Betty Xiong, Tony Lee, Roxana Daneshjou, Jonathan Frankle, Percy Liang, Michael Carbin, Christopher D. Manning
**BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text** **Authors:** Elliot Bolton, Abhinav Venigalla, Michihiro Yasunaga, David Hall, Betty Xiong, Tony Lee, Roxana Daneshjou, Jonathan Frankle, Percy Liang, Michael Carbin, and Christopher D. Manning **Abstract:** This paper introduces BioMedLM, a 2.7 billion parameter GPT-style autoregressive language model specifically trained on PubMed abstracts and full articles. BioMedLM is designed to address the limitations of large, general-purpose models like GPT-4 and Med-PaLM 2, which are computationally expensive, require internet access, and lack transparency in their training data. BioMedLM demonstrates strong performance on multiple-choice biomedical question-answering tasks, achieving scores competitive with much larger models. It also excels in generating useful answers to patient questions on medical topics. The model's small size allows for efficient fine-tuning on a single GPU and inference on laptops, making it suitable for organizations with limited resources and strict privacy requirements. BioMedLM is available on the Hugging Face Hub, promoting transparency and reproducibility in the field of biomedical NLP. **Key Contributions:** 1. **Model Design:** BioMedLM is a GPT-2 style autoregressive model with a custom Byte-Pair Encoding (BPE) tokenizer trained on PubMed abstracts and full articles. 2. **Performance:** BioMedLM achieves strong results on multiple-choice biomedical question-answering tasks, including MedMCQA (dev) with 57.3% accuracy and MMLU Medical Genetics with 69.0% accuracy. 3. **Generative Capabilities:** BioMedLM can produce multi-sentence answers to medical questions, demonstrating its potential for practical applications. 4. **Privacy and Cost:** The model's small size and closed training data make it more cost-effective and privacy-preserving compared to large, open-source models. **Related Work:** The paper reviews existing models and datasets, including GPT-Neo 2.7B, PubMedBERT, BioLinkBERT, and Galactica, highlighting the benefits of domain-specific training and the challenges of large-scale models. **Conclusion:** BioMedLM demonstrates the potential of medium-sized, domain-specific models in biomedical NLP, offering a transparent, privacy-preserving, and economical solution for specific tasks. The model's availability on the Hugging Face Hub encourages further research and application in the field.**BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text** **Authors:** Elliot Bolton, Abhinav Venigalla, Michihiro Yasunaga, David Hall, Betty Xiong, Tony Lee, Roxana Daneshjou, Jonathan Frankle, Percy Liang, Michael Carbin, and Christopher D. Manning **Abstract:** This paper introduces BioMedLM, a 2.7 billion parameter GPT-style autoregressive language model specifically trained on PubMed abstracts and full articles. BioMedLM is designed to address the limitations of large, general-purpose models like GPT-4 and Med-PaLM 2, which are computationally expensive, require internet access, and lack transparency in their training data. BioMedLM demonstrates strong performance on multiple-choice biomedical question-answering tasks, achieving scores competitive with much larger models. It also excels in generating useful answers to patient questions on medical topics. The model's small size allows for efficient fine-tuning on a single GPU and inference on laptops, making it suitable for organizations with limited resources and strict privacy requirements. BioMedLM is available on the Hugging Face Hub, promoting transparency and reproducibility in the field of biomedical NLP. **Key Contributions:** 1. **Model Design:** BioMedLM is a GPT-2 style autoregressive model with a custom Byte-Pair Encoding (BPE) tokenizer trained on PubMed abstracts and full articles. 2. **Performance:** BioMedLM achieves strong results on multiple-choice biomedical question-answering tasks, including MedMCQA (dev) with 57.3% accuracy and MMLU Medical Genetics with 69.0% accuracy. 3. **Generative Capabilities:** BioMedLM can produce multi-sentence answers to medical questions, demonstrating its potential for practical applications. 4. **Privacy and Cost:** The model's small size and closed training data make it more cost-effective and privacy-preserving compared to large, open-source models. **Related Work:** The paper reviews existing models and datasets, including GPT-Neo 2.7B, PubMedBERT, BioLinkBERT, and Galactica, highlighting the benefits of domain-specific training and the challenges of large-scale models. **Conclusion:** BioMedLM demonstrates the potential of medium-sized, domain-specific models in biomedical NLP, offering a transparent, privacy-preserving, and economical solution for specific tasks. The model's availability on the Hugging Face Hub encourages further research and application in the field.
Reach us at info@study.space
[slides and audio] BioMedLM%3A A 2.7B Parameter Language Model Trained On Biomedical Text