28 Feb 2024 | Julian Coda-Forno, Marcel Binz, Jane X. Wang, Eric Schulz
CogBench is a new benchmark that evaluates the behaviors of large language models (LLMs) using metrics derived from seven cognitive psychology experiments. The benchmark includes ten behavioral metrics and is applied to 35 LLMs, providing a rich dataset for analysis. The study uses statistical multilevel modeling to account for dependencies among models. Key findings include the importance of model size and reinforcement learning from human feedback (RLHF) in improving performance and aligning with human behavior. Open-source models are found to be less risk-prone than proprietary ones, and fine-tuning on code does not necessarily enhance behavior. Prompt-engineering techniques like chain-of-thought (CoT) and take-a-step-back (SB) influence different behavioral characteristics, with CoT improving probabilistic reasoning and SB promoting model-based behaviors. The benchmark highlights the value of behavioral metrics in understanding LLMs, offering a more comprehensive assessment than traditional performance-based benchmarks. The study also explores the impact of prompt-engineering techniques, showing that CoT enhances reasoning and SB improves model-based behaviors. Overall, CogBench provides insights into LLM behaviors and supports the development of more effective evaluation methods.CogBench is a new benchmark that evaluates the behaviors of large language models (LLMs) using metrics derived from seven cognitive psychology experiments. The benchmark includes ten behavioral metrics and is applied to 35 LLMs, providing a rich dataset for analysis. The study uses statistical multilevel modeling to account for dependencies among models. Key findings include the importance of model size and reinforcement learning from human feedback (RLHF) in improving performance and aligning with human behavior. Open-source models are found to be less risk-prone than proprietary ones, and fine-tuning on code does not necessarily enhance behavior. Prompt-engineering techniques like chain-of-thought (CoT) and take-a-step-back (SB) influence different behavioral characteristics, with CoT improving probabilistic reasoning and SB promoting model-based behaviors. The benchmark highlights the value of behavioral metrics in understanding LLMs, offering a more comprehensive assessment than traditional performance-based benchmarks. The study also explores the impact of prompt-engineering techniques, showing that CoT enhances reasoning and SB improves model-based behaviors. Overall, CogBench provides insights into LLM behaviors and supports the development of more effective evaluation methods.