[slides] Evaluating Large Language Model Biases in Persona-Steered Generation

The paper "Evaluating Large Language Model Biases in Persona-Steered Generation" by Andy Liu investigates the biases and steerability of large language models (LLMs) when generating text that reflects the views of multifaceted personas. The authors define an *incongruous persona* as one where a trait makes other traits less likely in human survey data, such as politically liberal individuals who support increased military spending. They find that LLMs are 9.7% less steerable towards incongruous personas compared to congruous ones, often generating stereotypical stances associated with demographic stereotypes rather than the target stance. Models fine-tuned with Reinforcement Learning from Human Feedback (RLHF) are more steerable, especially towards stances associated with political liberals and women, but this comes at the cost of reduced diversity in views. The study also shows that LLMs' performance in multiple-choice survey tasks does not predict their steerability in open-ended generation, highlighting the need for further research to improve the representation of diverse viewpoints. The results suggest that LLMs may perpetuate biases and caricatures in complex persona simulations, which could have social harms and limit their usefulness in interactive applications.The paper "Evaluating Large Language Model Biases in Persona-Steered Generation" by Andy Liu investigates the biases and steerability of large language models (LLMs) when generating text that reflects the views of multifaceted personas. The authors define an *incongruous persona* as one where a trait makes other traits less likely in human survey data, such as politically liberal individuals who support increased military spending. They find that LLMs are 9.7% less steerable towards incongruous personas compared to congruous ones, often generating stereotypical stances associated with demographic stereotypes rather than the target stance. Models fine-tuned with Reinforcement Learning from Human Feedback (RLHF) are more steerable, especially towards stances associated with political liberals and women, but this comes at the cost of reduced diversity in views. The study also shows that LLMs' performance in multiple-choice survey tasks does not predict their steerability in open-ended generation, highlighting the need for further research to improve the representation of diverse viewpoints. The results suggest that LLMs may perpetuate biases and caricatures in complex persona simulations, which could have social harms and limit their usefulness in interactive applications.

Evaluating Large Language Model Biases in Persona-Steered Generation

30 May 2024 | Andy Liu, Mona Diab, Daniel Fried