13 Feb 2024 | Ryan Liu, Theodore R. Sumers, Ishita Dasgupta, Thomas L. Griffiths
The paper explores how large language models (LLMs) navigate the trade-offs between honesty and helpfulness in conversational settings. Using psychological models and experiments designed to characterize human behavior, the study analyzes a range of LLMs and examines how optimization for human preferences or inference-time reasoning affects these trade-offs. The findings reveal that reinforcement learning from human feedback improves both honesty and helpfulness, while chain-of-thought (CoT) prompting skews LLMs towards helpfulness over honesty. Notably, GPT-4 Turbo demonstrates human-like response patterns, including sensitivity to conversational framing and listener's decision context. The study suggests that even abstract conversational values can be steered by zero-shot prompting, highlighting the importance of understanding and steering these trade-offs for the safe and effective deployment of conversational agents.The paper explores how large language models (LLMs) navigate the trade-offs between honesty and helpfulness in conversational settings. Using psychological models and experiments designed to characterize human behavior, the study analyzes a range of LLMs and examines how optimization for human preferences or inference-time reasoning affects these trade-offs. The findings reveal that reinforcement learning from human feedback improves both honesty and helpfulness, while chain-of-thought (CoT) prompting skews LLMs towards helpfulness over honesty. Notably, GPT-4 Turbo demonstrates human-like response patterns, including sensitivity to conversational framing and listener's decision context. The study suggests that even abstract conversational values can be steered by zero-shot prompting, highlighting the importance of understanding and steering these trade-offs for the safe and effective deployment of conversational agents.