This paper investigates the biases in large language models (LLMs) when generating text based on personas, focusing on how well models can be steered towards different types of personas. The study defines an incongruous persona as one where a trait makes other traits less likely in human survey data, such as a political liberal who supports increased military spending. The research finds that LLMs are 9.7% less steerable towards incongruous personas than congruous ones, often generating stereotypical stances rather than the target stance. Models fine-tuned with Reinforcement Learning from Human Feedback (RLHF) are more steerable, especially towards stances associated with political liberals and women, but show less diversity in their generated views.
The study evaluates models across various tasks, including open-ended text generation and multiple-choice survey responses. It finds that LLMs are less steerable towards incongruous personas, with a significant difference in steerability between congruous and incongruous personas. Models fine-tuned with RLHF show higher steerability but narrower views. The results suggest that LLMs may struggle to represent diverse viewpoints, especially when personas are incongruous. The study also highlights that model behavior in multiple-choice tasks does not reliably predict steerability in open-ended generation.
The research emphasizes the importance of evaluating models in open-ended text generation to uncover biases that may not be apparent in multiple-choice settings. It also shows that models may perpetuate stereotypes when generating text for incongruous personas, leading to reduced diversity in generated statements. The study concludes that while LLMs can be useful for persona-steered generation, there is still room for improvement in steerability towards a diverse range of personas and in generating nuanced representations of human opinions. The findings suggest that further research is needed to address these biases and improve the fairness and diversity of LLM outputs.This paper investigates the biases in large language models (LLMs) when generating text based on personas, focusing on how well models can be steered towards different types of personas. The study defines an incongruous persona as one where a trait makes other traits less likely in human survey data, such as a political liberal who supports increased military spending. The research finds that LLMs are 9.7% less steerable towards incongruous personas than congruous ones, often generating stereotypical stances rather than the target stance. Models fine-tuned with Reinforcement Learning from Human Feedback (RLHF) are more steerable, especially towards stances associated with political liberals and women, but show less diversity in their generated views.
The study evaluates models across various tasks, including open-ended text generation and multiple-choice survey responses. It finds that LLMs are less steerable towards incongruous personas, with a significant difference in steerability between congruous and incongruous personas. Models fine-tuned with RLHF show higher steerability but narrower views. The results suggest that LLMs may struggle to represent diverse viewpoints, especially when personas are incongruous. The study also highlights that model behavior in multiple-choice tasks does not reliably predict steerability in open-ended generation.
The research emphasizes the importance of evaluating models in open-ended text generation to uncover biases that may not be apparent in multiple-choice settings. It also shows that models may perpetuate stereotypes when generating text for incongruous personas, leading to reduced diversity in generated statements. The study concludes that while LLMs can be useful for persona-steered generation, there is still room for improvement in steerability towards a diverse range of personas and in generating nuanced representations of human opinions. The findings suggest that further research is needed to address these biases and improve the fairness and diversity of LLM outputs.