1 Jan 2024 | Yu Ying Chiu, Ashish Sharma, Inna Wanyin Lin, Tim Althoff
A computational framework for behavioral assessment of LLM therapists is introduced. The paper does not advocate for using large language models (LLMs) in therapeutic settings, nor does it establish their readiness. Instead, it aims to systematically characterize and assess the behavior of current LLMs when used for therapy. The emergence of LLMs like ChatGPT has increased interest in their potential as therapists, but understanding their behavior remains limited due to a lack of systematic studies. This paper proposes BOLT, a novel computational framework to study the conversational behavior of LLMs when used as therapists. It develops an in-context learning method to quantitatively measure LLM behavior based on 13 psychotherapy techniques. The framework compares LLM therapist behavior with high- and low-quality human therapy and studies how their behavior can be modulated to better reflect high-quality therapy. Analysis of GPT and Llama variants reveals that these LLMs often resemble behaviors more common in low-quality therapy, such as offering more problem-solving advice when clients share emotions. However, they also reflect more on clients' needs and strengths compared to low-quality therapy. The framework suggests that despite LLMs generating examples similar to human therapists, they are not fully consistent with high-quality care and require further research. BOLT uses two publicly available therapy conversation datasets to simulate conversations between LLM therapists and simulated clients. It identifies psychotherapy techniques in utterances and compares LLM behavior with human therapy. The framework shows that LLMs often exhibit behaviors similar to low-quality therapy, such as providing more problem-solving advice and reflecting on client needs. However, they also demonstrate the ability to reflect on client emotions and experiences, a key component of high-quality therapy. The paper finds that LLMs may prioritize problem-solving over reflection and normalization, which may be undesirable in therapy. The framework also shows that LLMs respond with more psychoeducation and normalization, similar to low-quality therapy. The study highlights the need for further research to ensure the quality of LLM-based therapy. BOLT provides a systematic way to assess the conversational behavior of LLM therapists and compare it with human therapy. The framework uses annotated datasets and identifies psychotherapy techniques in utterances. The results show that LLMs often exhibit behaviors similar to low-quality therapy, such as providing more problem-solving advice and reflecting on client needs. However, they also demonstrate the ability to reflect on client emotions and experiences, a key component of high-quality therapy. The study suggests that LLMs may not fully align with high-quality care and require further research to improve their effectiveness. The framework also highlights the importance of behavioral and quality assessments in evaluating LLM therapists.A computational framework for behavioral assessment of LLM therapists is introduced. The paper does not advocate for using large language models (LLMs) in therapeutic settings, nor does it establish their readiness. Instead, it aims to systematically characterize and assess the behavior of current LLMs when used for therapy. The emergence of LLMs like ChatGPT has increased interest in their potential as therapists, but understanding their behavior remains limited due to a lack of systematic studies. This paper proposes BOLT, a novel computational framework to study the conversational behavior of LLMs when used as therapists. It develops an in-context learning method to quantitatively measure LLM behavior based on 13 psychotherapy techniques. The framework compares LLM therapist behavior with high- and low-quality human therapy and studies how their behavior can be modulated to better reflect high-quality therapy. Analysis of GPT and Llama variants reveals that these LLMs often resemble behaviors more common in low-quality therapy, such as offering more problem-solving advice when clients share emotions. However, they also reflect more on clients' needs and strengths compared to low-quality therapy. The framework suggests that despite LLMs generating examples similar to human therapists, they are not fully consistent with high-quality care and require further research. BOLT uses two publicly available therapy conversation datasets to simulate conversations between LLM therapists and simulated clients. It identifies psychotherapy techniques in utterances and compares LLM behavior with human therapy. The framework shows that LLMs often exhibit behaviors similar to low-quality therapy, such as providing more problem-solving advice and reflecting on client needs. However, they also demonstrate the ability to reflect on client emotions and experiences, a key component of high-quality therapy. The paper finds that LLMs may prioritize problem-solving over reflection and normalization, which may be undesirable in therapy. The framework also shows that LLMs respond with more psychoeducation and normalization, similar to low-quality therapy. The study highlights the need for further research to ensure the quality of LLM-based therapy. BOLT provides a systematic way to assess the conversational behavior of LLM therapists and compare it with human therapy. The framework uses annotated datasets and identifies psychotherapy techniques in utterances. The results show that LLMs often exhibit behaviors similar to low-quality therapy, such as providing more problem-solving advice and reflecting on client needs. However, they also demonstrate the ability to reflect on client emotions and experiences, a key component of high-quality therapy. The study suggests that LLMs may not fully align with high-quality care and require further research to improve their effectiveness. The framework also highlights the importance of behavioral and quality assessments in evaluating LLM therapists.