[slides] CogBench%3A a large language model walks into a psychology lab

The paper introduces CogBench, a benchmark designed to evaluate the behavior of large language models (LLMs) using ten behavioral metrics derived from seven cognitive psychology experiments. This approach aims to provide a comprehensive understanding of LLMs' behaviors beyond their performance metrics. The study applies CogBench to 35 LLMs, including both proprietary and open-source models, and uses statistical multilevel modeling to analyze the data, accounting for the nested dependencies among fine-tuned versions of specific models. Key findings include: 1. **Model Size and Performance**: Larger models generally perform better and are more model-based. 2. **Reinforcement Learning from Human Feedback (RLHF)**: RLHF significantly improves LLMs' performance and aligns them more closely with human behavior. 3. **Open-Source Models**: These models exhibit less risk-taking behavior compared to proprietary models. 4. **Fine-Tuning on Code**: Fine-tuning on code does not necessarily enhance LLMs' behavior. 5. **Prompt-Engineering Techniques**: Chain-of-thought (CoT) prompting improves probabilistic reasoning, while take-a-step-back (SB) prompting fosters model-based behaviors. The paper highlights the importance of behavioral metrics and cognitive modeling in evaluating LLMs, providing a novel benchmark that can be used to gain deeper insights into the inner workings of these models.The paper introduces CogBench, a benchmark designed to evaluate the behavior of large language models (LLMs) using ten behavioral metrics derived from seven cognitive psychology experiments. This approach aims to provide a comprehensive understanding of LLMs' behaviors beyond their performance metrics. The study applies CogBench to 35 LLMs, including both proprietary and open-source models, and uses statistical multilevel modeling to analyze the data, accounting for the nested dependencies among fine-tuned versions of specific models. Key findings include: 1. **Model Size and Performance**: Larger models generally perform better and are more model-based. 2. **Reinforcement Learning from Human Feedback (RLHF)**: RLHF significantly improves LLMs' performance and aligns them more closely with human behavior. 3. **Open-Source Models**: These models exhibit less risk-taking behavior compared to proprietary models. 4. **Fine-Tuning on Code**: Fine-tuning on code does not necessarily enhance LLMs' behavior. 5. **Prompt-Engineering Techniques**: Chain-of-thought (CoT) prompting improves probabilistic reasoning, while take-a-step-back (SB) prompting fosters model-based behaviors. The paper highlights the importance of behavioral metrics and cognitive modeling in evaluating LLMs, providing a novel benchmark that can be used to gain deeper insights into the inner workings of these models.

CogBench: a large language model walks into a psychology lab

28 Feb 2024 | Julian Coda-Forno, Marcel Binz, Jane X. Wang, Eric Schulz