Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways

Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways

17 Jun 2024 | Shubham Atreja, Joshua Ashkinaze, Lingyao Li, Julia Mendelsohn, Libby Hemphill
This study investigates how prompt design affects the compliance and accuracy of large language models (LLMs) in generating annotations for computational social science (CSS) tasks. The research focuses on four CSS tasks: toxicity, sentiment, rumor stance, and news frames. The study uses three LLMs—ChatGPT, PaLM2, and Falcon7b—and evaluates how different prompt features—such as definition inclusion, output type (label or numerical score), explanation, and prompt length—affect the quality and distribution of LLM-generated annotations. The results show that LLM compliance and accuracy are highly dependent on prompt design. For example, prompting for numerical scores instead of labels reduces compliance and accuracy for most LLMs. Prompting with definitions improves ChatGPT’s accuracy without reducing its compliance but reduces compliance for PaLM2 and Falcon7b. Concise prompts can reduce annotation costs but may negatively impact accuracy or compliance, depending on the task and model. Prompting LLMs to explain their output increases compliance but changes the distribution of generated labels. The study highlights the importance of careful prompt design in CSS research, as different prompt strategies can lead to different annotation distributions, which may affect research outcomes. The findings suggest that while some prompt designs improve accuracy, others may introduce biases or inconsistencies. The study also emphasizes the need for researchers to consider the trade-offs between cost, accuracy, and compliance when designing prompts for LLMs. Overall, the research provides a practical guide for designing effective prompts for LLMs in CSS tasks.This study investigates how prompt design affects the compliance and accuracy of large language models (LLMs) in generating annotations for computational social science (CSS) tasks. The research focuses on four CSS tasks: toxicity, sentiment, rumor stance, and news frames. The study uses three LLMs—ChatGPT, PaLM2, and Falcon7b—and evaluates how different prompt features—such as definition inclusion, output type (label or numerical score), explanation, and prompt length—affect the quality and distribution of LLM-generated annotations. The results show that LLM compliance and accuracy are highly dependent on prompt design. For example, prompting for numerical scores instead of labels reduces compliance and accuracy for most LLMs. Prompting with definitions improves ChatGPT’s accuracy without reducing its compliance but reduces compliance for PaLM2 and Falcon7b. Concise prompts can reduce annotation costs but may negatively impact accuracy or compliance, depending on the task and model. Prompting LLMs to explain their output increases compliance but changes the distribution of generated labels. The study highlights the importance of careful prompt design in CSS research, as different prompt strategies can lead to different annotation distributions, which may affect research outcomes. The findings suggest that while some prompt designs improve accuracy, others may introduce biases or inconsistencies. The study also emphasizes the need for researchers to consider the trade-offs between cost, accuracy, and compliance when designing prompts for LLMs. Overall, the research provides a practical guide for designing effective prompts for LLMs in CSS tasks.
Reach us at info@study.space