Towards a Personal Health Large Language Model

Towards a Personal Health Large Language Model

June 11, 2024 | Justin Cosentino, Anastasiya Belyaeva, Xin Liu, Nicholas A. Furlotte, Zhun Yang, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, Robby Bryant, Ryan G. Gomes, Allen Jiang, Roy Lee, Yun Liu, Javier Perez, Jameson K. Rogers, Cathy Speed, Shyam Tailor, Megan Walker, Jeffrey Yu, Tim Althoff, Conor Heneghan, John Hernandez, Mark Malhotra, Leon Stern, Yossi Matias, Greg S. Corrado, Shwetak Patel, Shravya Shetty, Jiening Zhan, Shruthi Prabhakara, Daniel McDuff, and Cory Y. McLean
This paper introduces the Personal Health Large Language Model (PH-LLM), a version of Gemini fine-tuned for text understanding and reasoning over numerical time-series personal health data for applications in sleep and fitness. The PH-LLM is evaluated on three tasks: coaching recommendations, multiple choice exams assessing expert knowledge, and prediction of patient-reported outcomes (PROs). The model was trained on three novel benchmark datasets: 857 case studies in sleep and fitness, 629 multiple choice questions on sleep medicine and fitness, and 16 derived binary outcomes from a large IRB-approved study. The PH-LLM achieved 79% on sleep and 88% on fitness, both exceeding average scores from a sample of human experts. The model was also trained to predict self-reported sleep disruption and impairment outcomes from textual and multimodal encoding representations of wearable sensor data. The results show that multimodal encoding is both necessary and sufficient to match performance of a suite of discriminative models to predict these outcomes. The PH-LLM was also evaluated on professional examinations and showed performance comparable to human experts. The model was fine-tuned on sleep case studies and showed significant improvements in using relevant domain knowledge and personalizing information for sleep insights. The PH-LLM was also evaluated on case study responses and showed performance comparable to human experts. The model was also evaluated on qualitative interviews and showed that the case study responses were helpful for experts. The PH-LLM was also evaluated on statistical analyses and showed that the model's performance was statistically significant. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was alsoThis paper introduces the Personal Health Large Language Model (PH-LLM), a version of Gemini fine-tuned for text understanding and reasoning over numerical time-series personal health data for applications in sleep and fitness. The PH-LLM is evaluated on three tasks: coaching recommendations, multiple choice exams assessing expert knowledge, and prediction of patient-reported outcomes (PROs). The model was trained on three novel benchmark datasets: 857 case studies in sleep and fitness, 629 multiple choice questions on sleep medicine and fitness, and 16 derived binary outcomes from a large IRB-approved study. The PH-LLM achieved 79% on sleep and 88% on fitness, both exceeding average scores from a sample of human experts. The model was also trained to predict self-reported sleep disruption and impairment outcomes from textual and multimodal encoding representations of wearable sensor data. The results show that multimodal encoding is both necessary and sufficient to match performance of a suite of discriminative models to predict these outcomes. The PH-LLM was also evaluated on professional examinations and showed performance comparable to human experts. The model was fine-tuned on sleep case studies and showed significant improvements in using relevant domain knowledge and personalizing information for sleep insights. The PH-LLM was also evaluated on case study responses and showed performance comparable to human experts. The model was also evaluated on qualitative interviews and showed that the case study responses were helpful for experts. The PH-LLM was also evaluated on statistical analyses and showed that the model's performance was statistically significant. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also evaluated on results and showed that the model's performance was comparable to human experts. The PH-LLM was also
Reach us at info@study.space
[slides] Towards a Personal Health Large Language Model | StudySpace