2024 | Li Wang, Xi Chen, XiangWen Deng, Hao Wen, MingKe You, WeiZhi Liu, Qi Li, Jian Li
This study investigates the effectiveness of prompt engineering in improving the consistency and reliability of large language models (LLMs) in answering medical questions, particularly in relation to the American Academy of Orthopedic Surgeons (AAOS) evidence-based guidelines for osteoarthritis (OA). The research evaluates different prompt styles across various LLMs to determine their performance in aligning with clinical guidelines. The results show that the gpt-4-Web model with ROT prompting achieved the highest overall consistency (62.9%) and strong recommendation consistency (77.5%). However, the reliability of different LLMs varied, with some models showing inconsistent results. The study highlights the importance of prompt engineering in enhancing the accuracy of LLM responses to professional medical questions. It also emphasizes the need for further research to optimize LLM performance in clinical medicine, considering factors such as model architecture, parameters, and training data. The study concludes that appropriate prompts can significantly improve the accuracy of LLM responses to medical questions, and that future research should focus on developing prompts tailored for medical applications. The findings suggest that prompt engineering is a crucial tool in improving the reliability and consistency of LLMs in clinical settings.This study investigates the effectiveness of prompt engineering in improving the consistency and reliability of large language models (LLMs) in answering medical questions, particularly in relation to the American Academy of Orthopedic Surgeons (AAOS) evidence-based guidelines for osteoarthritis (OA). The research evaluates different prompt styles across various LLMs to determine their performance in aligning with clinical guidelines. The results show that the gpt-4-Web model with ROT prompting achieved the highest overall consistency (62.9%) and strong recommendation consistency (77.5%). However, the reliability of different LLMs varied, with some models showing inconsistent results. The study highlights the importance of prompt engineering in enhancing the accuracy of LLM responses to professional medical questions. It also emphasizes the need for further research to optimize LLM performance in clinical medicine, considering factors such as model architecture, parameters, and training data. The study concludes that appropriate prompts can significantly improve the accuracy of LLM responses to medical questions, and that future research should focus on developing prompts tailored for medical applications. The findings suggest that prompt engineering is a crucial tool in improving the reliability and consistency of LLMs in clinical settings.