[slides and audio] Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks

This study introduces Meerkat, a new family of open-source medical AI systems with enhanced reasoning skills derived from medical textbooks. The models, ranging from 7 to 70 billion parameters, were trained using a synthetic dataset containing high-quality chain-of-thought reasoning paths from 18 medical textbooks and diverse instruction-following datasets. Meerkat-7B achieved a 77.1% accuracy on the MedQA benchmark, surpassing previous models like MediTron and GPT-3.5, and was the first 7B model to exceed the USMLE passing threshold. Meerkat-70B outperformed GPT-4 by an average of 1.3% across six medical benchmarks and correctly diagnosed 21 out of 38 complex clinical cases, outperforming humans' 13.8 and closely matching GPT-4's 21.8. The models provided more detailed free-form responses to clinical queries compared to existing small models, approaching the performance level of large commercial models. The study highlights the effectiveness of Meerkat in addressing complex medical challenges and demonstrates that small models can achieve performance comparable to large models through enhanced reasoning skills acquired from medical textbooks. The research also emphasizes the importance of further development in creating more reliable AI systems for healthcare applications.This study introduces Meerkat, a new family of open-source medical AI systems with enhanced reasoning skills derived from medical textbooks. The models, ranging from 7 to 70 billion parameters, were trained using a synthetic dataset containing high-quality chain-of-thought reasoning paths from 18 medical textbooks and diverse instruction-following datasets. Meerkat-7B achieved a 77.1% accuracy on the MedQA benchmark, surpassing previous models like MediTron and GPT-3.5, and was the first 7B model to exceed the USMLE passing threshold. Meerkat-70B outperformed GPT-4 by an average of 1.3% across six medical benchmarks and correctly diagnosed 21 out of 38 complex clinical cases, outperforming humans' 13.8 and closely matching GPT-4's 21.8. The models provided more detailed free-form responses to clinical queries compared to existing small models, approaching the performance level of large commercial models. The study highlights the effectiveness of Meerkat in addressing complex medical challenges and demonstrates that small models can achieve performance comparable to large models through enhanced reasoning skills acquired from medical textbooks. The research also emphasizes the importance of further development in creating more reliable AI systems for healthcare applications.

Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks

30 Jun 2024 | Hyunjae Kim¹, Hyeon Hwang¹, Jiwoo Lee¹, Sihyeon Park¹, Dain Kim¹, Taehwoo Lee¹, Chanwoong Yoon¹, Jiwoong Sohn¹, Donghee Choi², Jaewoo Kang¹,³*