[slides and audio] Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways

The paper "Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways" by Shubham Atreja, Joshua Ashkinaze, Lingyao Li, Julia Mendelsohn, and Libby Hemphill explores the impact of prompt design on the compliance and accuracy of large language models (LLMs) in computational social science tasks. The authors conduct a large-scale multi-prompt experiment using three LLMs (ChatGPT, PaLM2, and Falcon7b) to annotate four datasets (toxicity, sentiment, rumor stance, and news frames). They vary prompt design features such as definition inclusion, output type, explanation, and prompt length to understand how these factors affect LLM performance. Key findings include: - LLM compliance and accuracy are highly dependent on prompt design, with minor changes in prompts leading to significant variations in output. - Prompting for numerical scores instead of labels generally reduces compliance and accuracy. - The best prompt setup is task-dependent, and minor changes can cause large shifts in the distribution of generated labels. - Prompting with definitions improves accuracy for some models but reduces compliance for others. - Concise prompts can reduce costs but may negatively impact accuracy or compliance, depending on the task and model. - Prompting models to explain their output improves compliance but can also alter the distribution of generated labels. The study highlights the need for researchers and practitioners to carefully consider prompt design to ensure valid and reliable annotations, as different prompt strategies can significantly affect the quality and distribution of LLM-generated annotations. The findings serve as both a warning and a practical guide for future research and applications in computational social science.The paper "Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways" by Shubham Atreja, Joshua Ashkinaze, Lingyao Li, Julia Mendelsohn, and Libby Hemphill explores the impact of prompt design on the compliance and accuracy of large language models (LLMs) in computational social science tasks. The authors conduct a large-scale multi-prompt experiment using three LLMs (ChatGPT, PaLM2, and Falcon7b) to annotate four datasets (toxicity, sentiment, rumor stance, and news frames). They vary prompt design features such as definition inclusion, output type, explanation, and prompt length to understand how these factors affect LLM performance. Key findings include: - LLM compliance and accuracy are highly dependent on prompt design, with minor changes in prompts leading to significant variations in output. - Prompting for numerical scores instead of labels generally reduces compliance and accuracy. - The best prompt setup is task-dependent, and minor changes can cause large shifts in the distribution of generated labels. - Prompting with definitions improves accuracy for some models but reduces compliance for others. - Concise prompts can reduce costs but may negatively impact accuracy or compliance, depending on the task and model. - Prompting models to explain their output improves compliance but can also alter the distribution of generated labels. The study highlights the need for researchers and practitioners to carefully consider prompt design to ensure valid and reliable annotations, as different prompt strategies can significantly affect the quality and distribution of LLM-generated annotations. The findings serve as both a warning and a practical guide for future research and applications in computational social science.

Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways

17 Jun 2024 | Shubham Atreja, Joshua Ashkinaze, Lingyao Li, Julia Mendelsohn, Libby Hemphill