How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?

How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?

2024 | Ryan Liu, Theodore R. Sumers, Ishita Dasgupta, Thomas L. Griffiths
Large Language Models (LLMs) navigate the balance between honesty and helpfulness in conversations by aligning with human conversational values. This study explores how LLMs handle trade-offs between these two key values, using psychological models and experiments to analyze their behavior. The research finds that reinforcement learning from human feedback (RLHF) improves both honesty and helpfulness, while chain-of-thought prompting (CoT) skews LLMs toward helpfulness at the expense of honesty. GPT-4 Turbo demonstrates human-like response patterns, including sensitivity to conversational framing and listener context. The study reveals that LLMs internalize conversational values and can be steered by prompts, suggesting that these values can be adjusted through zero-shot prompting. The findings highlight the importance of understanding and guiding these trade-offs in LLMs to ensure they align with human values in real-world applications. The research also formalizes helpfulness and honesty using Gricean maxims and explores how different training and prompting methods influence these values. The results show that LLMs can be steered to prioritize honesty or helpfulness depending on the context, and that models like GPT-4 Turbo exhibit human-like trade-offs between these values. The study contributes to the understanding of how LLMs can be aligned with human conversational norms and provides insights into how these values can be manipulated through prompting.Large Language Models (LLMs) navigate the balance between honesty and helpfulness in conversations by aligning with human conversational values. This study explores how LLMs handle trade-offs between these two key values, using psychological models and experiments to analyze their behavior. The research finds that reinforcement learning from human feedback (RLHF) improves both honesty and helpfulness, while chain-of-thought prompting (CoT) skews LLMs toward helpfulness at the expense of honesty. GPT-4 Turbo demonstrates human-like response patterns, including sensitivity to conversational framing and listener context. The study reveals that LLMs internalize conversational values and can be steered by prompts, suggesting that these values can be adjusted through zero-shot prompting. The findings highlight the importance of understanding and guiding these trade-offs in LLMs to ensure they align with human values in real-world applications. The research also formalizes helpfulness and honesty using Gricean maxims and explores how different training and prompting methods influence these values. The results show that LLMs can be steered to prioritize honesty or helpfulness depending on the context, and that models like GPT-4 Turbo exhibit human-like trade-offs between these values. The study contributes to the understanding of how LLMs can be aligned with human conversational norms and provides insights into how these values can be manipulated through prompting.
Reach us at info@study.space
[slides and audio] How do Large Language Models Navigate Conflicts between Honesty and Helpfulness%3F