This paper investigates the impact of Chain-of-Thought (CoT) prompting on gender bias in large language models (LLMs) for unscalable tasks. The study constructs a benchmark called Multi-step Gender Bias Reasoning (MGBR) to evaluate how LLMs handle gender-related biases in tasks involving counting feminine and masculine words. The benchmark includes lists of words with gendered and occupational associations, and requires LLMs to count the number of feminine or masculine words. The study finds that without step-by-step reasoning, most LLMs exhibit socially biased predictions, even for simple tasks. However, when prompted with CoT, LLMs are able to reduce this bias and make more fair predictions.
The study compares different prompting strategies, including zero-shot, few-shot, CoT, and debiasing prompts. Results show that CoT significantly reduces gender bias in LLMs, as it encourages the models to explicitly consider the gender of each word in the list. The study also finds that the MGBR benchmark has high correlation with existing bias evaluation metrics for LLMs, such as the Bias Benchmark for QA (BBQ) and Bias Benchmark for Natural Language Inference (BNLI), but low correlation with intrinsic bias evaluation metrics like Crowds-Pairs (CP) and StereoSet (SS). This suggests that MGBR evaluates biases that affect downstream tasks.
The study also explores the effectiveness of CoT in debiasing LLMs for other benchmark tasks, such as BBQ and BNLI. Results show that CoT is more effective than debiasing prompts in reducing gender bias in these tasks. The study concludes that CoT can be an effective method for mitigating gender bias in LLMs, particularly in unscalable tasks. The findings suggest that future research should explore the application of CoT techniques to non-binary genders and other types of social biases.This paper investigates the impact of Chain-of-Thought (CoT) prompting on gender bias in large language models (LLMs) for unscalable tasks. The study constructs a benchmark called Multi-step Gender Bias Reasoning (MGBR) to evaluate how LLMs handle gender-related biases in tasks involving counting feminine and masculine words. The benchmark includes lists of words with gendered and occupational associations, and requires LLMs to count the number of feminine or masculine words. The study finds that without step-by-step reasoning, most LLMs exhibit socially biased predictions, even for simple tasks. However, when prompted with CoT, LLMs are able to reduce this bias and make more fair predictions.
The study compares different prompting strategies, including zero-shot, few-shot, CoT, and debiasing prompts. Results show that CoT significantly reduces gender bias in LLMs, as it encourages the models to explicitly consider the gender of each word in the list. The study also finds that the MGBR benchmark has high correlation with existing bias evaluation metrics for LLMs, such as the Bias Benchmark for QA (BBQ) and Bias Benchmark for Natural Language Inference (BNLI), but low correlation with intrinsic bias evaluation metrics like Crowds-Pairs (CP) and StereoSet (SS). This suggests that MGBR evaluates biases that affect downstream tasks.
The study also explores the effectiveness of CoT in debiasing LLMs for other benchmark tasks, such as BBQ and BNLI. Results show that CoT is more effective than debiasing prompts in reducing gender bias in these tasks. The study concludes that CoT can be an effective method for mitigating gender bias in LLMs, particularly in unscalable tasks. The findings suggest that future research should explore the application of CoT techniques to non-binary genders and other types of social biases.