[slides and audio] Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs

This study explores the application of prompt engineering in large language models (LLMs) to improve their reliability and consistency in answering medical questions, particularly focusing on osteoarthritis (OA) guidelines from the American Academy of Orthopedic Surgeons (AAOS). Prompt engineering involves designing specific prompts to guide LLMs to provide more accurate and consistent responses. The study used four types of prompts—IO, 0-COT, P-COT, and ROT—and tested them on nine different LLMs, including gpt-4-Web, gpt-3.5-ft-0, gpt-3.5-API-0, Bard, and others. Key findings include: - **Consistency**: gpt-4-Web with ROT prompting showed the highest overall consistency (62.9%) and significant performance for strong recommendations (77.5%). - **Reliability**: The reliability of different LLMs for various prompts was not stable, with Fleiss kappa values ranging from −0.002 to 0.984. - **Subgroup Analysis**: ROT prompting performed better at the strong recommendation level compared to other prompts. - **Invalid Data**: Some responses were invalid, and specific procedures were used to handle these data points. The study highlights the importance of prompt engineering in improving the accuracy and reliability of LLMs in clinical medicine. Future research should focus on optimizing prompt engineering techniques, developing specialized LLMs, and conducting real-time interactions with healthcare professionals and patients to further enhance the application of LLMs in healthcare.This study explores the application of prompt engineering in large language models (LLMs) to improve their reliability and consistency in answering medical questions, particularly focusing on osteoarthritis (OA) guidelines from the American Academy of Orthopedic Surgeons (AAOS). Prompt engineering involves designing specific prompts to guide LLMs to provide more accurate and consistent responses. The study used four types of prompts—IO, 0-COT, P-COT, and ROT—and tested them on nine different LLMs, including gpt-4-Web, gpt-3.5-ft-0, gpt-3.5-API-0, Bard, and others. Key findings include: - **Consistency**: gpt-4-Web with ROT prompting showed the highest overall consistency (62.9%) and significant performance for strong recommendations (77.5%). - **Reliability**: The reliability of different LLMs for various prompts was not stable, with Fleiss kappa values ranging from −0.002 to 0.984. - **Subgroup Analysis**: ROT prompting performed better at the strong recommendation level compared to other prompts. - **Invalid Data**: Some responses were invalid, and specific procedures were used to handle these data points. The study highlights the importance of prompt engineering in improving the accuracy and reliability of LLMs in clinical medicine. Future research should focus on optimizing prompt engineering techniques, developing specialized LLMs, and conducting real-time interactions with healthcare professionals and patients to further enhance the application of LLMs in healthcare.

Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs

(2024)7:41 | Li Wang, Xi Chen, XiangWen Deng, Hao Wen, MingKe You, WeiZhi Liu, Qi Li & Jian Li