[slides] Measuring Implicit Bias in Explicitly Unbiased Large Language Models

The paper "Measuring Implicit Bias in Explicitly Unbiased Large Language Models" by Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L. Griffiths addresses the challenge of measuring implicit biases in large language models (LLMs) that appear unbiased on explicit social bias tests. The authors introduce two new measures: LLM Implicit Bias and LLM Decision Bias. LLM Implicit Bias is a prompt-based method adapted from the Implicit Association Test (IAT) to reveal implicit biases, while LLM Decision Bias is designed to detect subtle discrimination in decision-making tasks. These measures are based on psychological research and aim to capture the nuanced biases that may affect downstream behaviors. The study found pervasive stereotype biases in 8 value-aligned LLMs across 4 social categories (race, gender, religion, health) and 21 stereotypes. Despite the models' apparent lack of bias on existing benchmarks, the proposed measures revealed significant implicit biases, such as racial and gender stereotypes. The LLM Implicit Bias measure correlates with existing language model embedding-based bias methods but better predicts downstream behaviors measured by LLM Decision Bias. The paper also discusses the limitations and implications of these findings, emphasizing the importance of understanding and addressing implicit biases in LLMs to ensure fair and equitable decision-making. The authors conclude that their approach demonstrates how psychology can inspire new methods for assessing LLMs and highlights the need for further research to understand and mitigate these biases.The paper "Measuring Implicit Bias in Explicitly Unbiased Large Language Models" by Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L. Griffiths addresses the challenge of measuring implicit biases in large language models (LLMs) that appear unbiased on explicit social bias tests. The authors introduce two new measures: LLM Implicit Bias and LLM Decision Bias. LLM Implicit Bias is a prompt-based method adapted from the Implicit Association Test (IAT) to reveal implicit biases, while LLM Decision Bias is designed to detect subtle discrimination in decision-making tasks. These measures are based on psychological research and aim to capture the nuanced biases that may affect downstream behaviors. The study found pervasive stereotype biases in 8 value-aligned LLMs across 4 social categories (race, gender, religion, health) and 21 stereotypes. Despite the models' apparent lack of bias on existing benchmarks, the proposed measures revealed significant implicit biases, such as racial and gender stereotypes. The LLM Implicit Bias measure correlates with existing language model embedding-based bias methods but better predicts downstream behaviors measured by LLM Decision Bias. The paper also discusses the limitations and implications of these findings, emphasizing the importance of understanding and addressing implicit biases in LLMs to ensure fair and equitable decision-making. The authors conclude that their approach demonstrates how psychology can inspire new methods for assessing LLMs and highlights the need for further research to understand and mitigate these biases.

Measuring Implicit Bias in Explicitly Unbiased Large Language Models

23 May 2024 | Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, Thomas L. Griffiths