[slides and audio] Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models

This study explores the feasibility and reliability of using large language models (LLMs) to assess the risk of bias (ROB) in randomized clinical trials (RCTs). A survey study was conducted from August 10 to October 30, 2023, involving 30 RCTs selected from published systematic reviews. Two LLMs, ChatGPT (LLM 1) and Claude (LLM 2), were guided by a structured prompt to assess ROB using a modified Cochrane ROB tool. The results were compared with assessments by three experts, considered as the criterion standard. Both LLMs demonstrated high accuracy, with LLM 2 achieving a significantly higher correct assessment rate of 89.5% compared to LLM 1's 84.5%. Domain-specific accuracy varied, with sensitivity below 0.80 in domains 1 (random sequence generation), 2 (allocation concealment), and 6 (other concerns). Consistency rates were high, with Cohen κ exceeding 0.80 in most domains for both models. Assessment time was significantly shorter for LLM 2 (53 seconds) compared to LLM 1 (77 seconds). The study concludes that LLMs can be effective tools for assessing ROB in systematic reviews, offering substantial accuracy and efficiency.This study explores the feasibility and reliability of using large language models (LLMs) to assess the risk of bias (ROB) in randomized clinical trials (RCTs). A survey study was conducted from August 10 to October 30, 2023, involving 30 RCTs selected from published systematic reviews. Two LLMs, ChatGPT (LLM 1) and Claude (LLM 2), were guided by a structured prompt to assess ROB using a modified Cochrane ROB tool. The results were compared with assessments by three experts, considered as the criterion standard. Both LLMs demonstrated high accuracy, with LLM 2 achieving a significantly higher correct assessment rate of 89.5% compared to LLM 1's 84.5%. Domain-specific accuracy varied, with sensitivity below 0.80 in domains 1 (random sequence generation), 2 (allocation concealment), and 6 (other concerns). Consistency rates were high, with Cohen κ exceeding 0.80 in most domains for both models. Assessment time was significantly shorter for LLM 2 (53 seconds) compared to LLM 1 (77 seconds). The study concludes that LLMs can be effective tools for assessing ROB in systematic reviews, offering substantial accuracy and efficiency.